Database processing
This documentation concerns hacking instructions of scripts in src/
that handle and facilitate new word classification and lexical database stuff. The current version is just an unstructured collection of hacks. The data is stored in tsv files roughly as is understood by python’s csv.writer
and csv.DictWriter
classes.
(I’d like to rewrite this part in SQLite or something that maintains consistency of data on updates, stay tuned for updates in OmorfiLexicalDatabase. I would appreciate help of someone who actually likes database even though I studied this stuff decade ago and know how to do it I am not awfully excited about it anymore.)
Introduction
Lot of omorfi’s functionality is based on large databases of lexical data, that is, information about words. Data from different sources has different kinds of information, and often lacking lot of explicit knowledge of the word’s morphophonological structures, so we have a ton of python scripts to figure these data out semi-automatically. This process is also called guessing. Ideally most data sources should have all this data available verified by humans so that we wouldn’t have to guess, but new words come up all the time and this set of scripts is done to help people in classification. The data is set up in a python’s dict()
structure.
The whole process can be summarised by this simple flow-chart:
The files in src/lexemes/*.tsv
are joined column by column based on the lemma + homonym as key. The files in src/paradigms/*.tsv
are joined en masse based on new_para class.
Details
The map that we collect into python contains following fields. These main fields can form unique keys:
- lemma is the word’s id / the dictionary form.
- new_para is our full classification of the word.
- pos is morphological part-of-speech.
- homonym is a number denoting different lexemes with same lemma.
These are used for compatibility with official dictionaries
- kotus_tn is a combined classification is the kotus (RILF) dictionaries.
- kotus_av is the gradation classification from the same dictionaries.
These are used to tweak morphological features:
- plurale_tantum determines if singular forms are used.
- possessive determines if possessives are used.
- clitics determines if clitics are used.
These are used in applications, for semantics, pragmatics, whatever:
- is_proper determines if word is proper noun.
- proper_noun_class determines proper noun’s class
- particle is a string of additional particle analyses
- sem determines optional semantic classes.
- noun_class is a string for additional noun analyses
- adjective_class is a string for additional adjective analyses
- numeral_class is a string for additional numeral analyses
- pronoun is a string of additional pronoun analyses
- style determines usage limitations of word.
- origin denotes the data source for the lexeme.
These are used by implementations for morphographemics, etc.:
- stub is the prefix of the word that does not alternate.
- gradestem is the stub of the word with gradation morphophonemes.
- twolstem is the stem of word with alternations marked as special symbols
- grade_dir determines direction of gradation.
- harmony determines the variant of vowel harmony.
- is_suffix determines if word is bound morpheme after host root.
- is_prefix determines if word is bound morpheme before host root.
- stem_diphthong determines stem’s diphthong for diphthong variation.
- stem_vowel determines stem vowel for lengthening and copying.
- extra_e determines whether stem ends in e or consonant.
- extra_i determines whether stem ends in i or consonant.
- pronunciation holds pronunciation information needed in guessing other features.
- boundaries marks word part boundaries for compounds.
- bracketstub is a combination of stub and stem used to generate complex readings for certain analysis styles
Lemma
It is not fully unique, but almost. This is required for guessing, we cannot do anything without it. This is encoded as a string
.
New para
Is our classification to the word that contains all the information that is required to inflect it properly in omorfi systems. Not guessable. This classification is detailed in other wiki pages: . This is encoded as a list
.
Note that this encodes all the information of other variables so it can be expanded to all other variables, or other variables can be used to guess this one.
POS
The POS used by omorfi is strictly limited to morphological features that can be seen from the inflection of the word. Not guessable. It works like this: Nouns inflect in case/number forms, Adjectives have comparative derivation on top of that. Verbs inflect in tense/mood and person forms among others. Particles do not inflect. This is encoded as string in ['N', 'A', 'V', 'P']
Homonym
Homonym numbering is used for separating words with same dictionary form from each other. Not guessable. A basis for differentiation needs be evidenced by real world use: two separate inflectional paradigms is acceptable for homonym numbering, although one may consider adding a new combined paradigm if the semantics are clearly indistinct. An evidence for different semantics can be e.g. translation: if two words have systematically different translations it is a good reason to believe they are homonyms. Should be encoded as a positive int
running number from 1; numbers shall be given in the order database is sorted.
Kotus TN
An official dictionary classification. Not guessable. We can infer a lot of useful information on it, so it is possible to donate us databases containing only lemma and kotus_tn and we can guess the rest. Should be a string encoding an int in range(1,102)
.
Kotus AV
An official dictionary classification. Not guessable It removes need of guessing gradation beyond one specific pair of letters. should be a string in [False, '0ABCDEFGHIJKLMNOPT']
.
Plurale tantum
Determines whether nominal is allowed to have singular forms. Can be guessed for all nominals, when dictionary form is plural and differs significantly from singular form.
Possessive
Determines whether partially inflecting word can take up possessives or not. Not guessable in general case.
Clitic
Determines whether partially inflecting or non-inflecting word can take up clitics. Not guessable in general case.
Proper?
Determines if noun is a proper noun. Mostly guessable from uppercased lemma.
Proper class
A semantic class for proper noun, not guessable at all, comes from other systems.
Style
Any arbitrary pragmatic usage limitation for word. Not guessable at all, comes from other systems.
Stub
Stub the part of word that does not undergo any alternations, and thus a good starting point for many practical implementations of morphology. Fully guessable.
Gradestem
The original omorfi implementation used stem’s with gradation marked but lots of other variation in lexc stuff. This is legacy stem for that and is fully guessable.
Twolstem
This stem could use twol rules for all variations. Fully guessable.
Gradation direction
The gradation can be split in two cases depending what is the grade in the lemma form. Guessable in most cases.
Harmony
The suffixes depend on the vowel frontness of the word. Fully guessable most of the time, a few words need to override this or relax.
Suffix? Prefix?
Some words in dictionary do not appear without compound parts, this is most typically marked with hyphen in lemma and is thus Fully guessable.
Stem diphthong
Determines variant of uo/yö/ie words. Guessable. Was needed for few kotus classes collapsing all stuff nastily. Fully guessable
Stem vowel
Determines vowel of illative forms, if needed. Guessable from last vowel of lemma, unless the pronunciation differs from the writing (in foreign loans). Guessable in most cases.
Extra E, I
Determines variant of singular nominative. Was needed to overcome kotus classes 5, 6 and 49 collapsing all sorts of lemmas in on place. Fully guessable
Pronunciation
Pronunciation information from the stem or differing pre-defined pronunciation is stored here for guessing other features (vowel harmony).
Boundaries
For compounds, the lemma with word part boundaries is given here to help determining the vowel harmony correctly.
Bracketstub
For analysis styles like FTB 3.1, the lemmas for compound initial words are different from compound final, to accommodate this a complex structure containing both stub and stem is written into lexc files to avoid duplicating all the lexicons.
Particle / Noun class / Adjective class / pronoun / numeral class / verbal arguments
Determines additional analyses for words of given pos: these are used to inject additional lexical data to analyses: e.g.: Particle -> adposition, preposition, genitive complement, numeral -> ordinal roman digit, verb -> transitive with elative argument.
Sem
Determines optional semantic classes.
Origin
Denotes the data source for the lexeme. One of {omorfi, kotus, joukahainen, finnwordnet, fiwikt, omegawiki, unihu}