Omorfi–Open morphology of Finnish

Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner. The main components of this repository are:

  1. a lexical database containing hundreds of thousands of words (c.f. lexical statistics)
  2. a collection of scripts to convert lexical database into formats used by upstream NLP tools (c.f. lexical processing)
  3. an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools
  4. a collection of relatively simple APIs for a selection of languages and scripts to apply the NLP tools and access the database

The formats we produce are (links to free open source implementations included):

  1. lexc, as processed by HFST and foma, to be used for morphological analysis, stemming, segmentation, natural language generation, hyphenation and as a basis for language models,
  2. apertium, to be used for machine translation
  3. voikko, to be used for spell-checking and correction
  4. kotus-sanalista, lexical markup framework, tab-separated values, etc. for long and short term storage, intermediate formats.

Documentation

The main point of up-to-date documentation is these webpages. You should find list of all pages on the left.

Everyone should read at least versioning information and readme:

  1. Versions and download info
  2. README

If you wish to use omorfi in a serious application you probably found out from the README that a python or java API is the way to go:

  1. Python API
  2. Java API

There’s a mass of automatically generated statistics from each version of omorfi:

  1. Lexical statistics
  2. Coverage tests
  3. Missing word-forms by corpora
  4. Faithfulness tests
  5. Speed
  6. Automata sizes

The design principles of morphological analysis have been changed a dozen of times to accommodate various applications:

  1. Analysis tags
  2. Design “principles” for tags
  3. Internal keys and codes

More internal documentations:

  1. Directory layout
  2. Database struccture

And more…

Contact

If you want to discuss about omorfi in Finnish or English, the IRC channels #omorfi and #hfst on Freenode are available for immediate chats (Freenode webchat here). The google group discussion list omorfi-devel@groups.google.com (Google groups web interface here) can also be used, it may require subscription but is very low volume. If, for some reason, you wish to discuss in private, authors’ private emails can be used as contact, but prefer public chats for general usage etc., questions as the archive of frequently asked questions will surely benefit everyone. For bug reports use the issue functionality on this site, or even pull requests.

Alternatives of omorfi

If omorfi doesn’t suit your needs, you may want to try other similar products: suomi-malaga of voikko fame is another morphological analyser of Finnish. Grammatical Framework also has NLP components for Finnish, and it’s written in haskell.

If you want to use commercial products, there are surely some available somewhere.