Lexico-Morphology

TAN-LM files are used to associate words or word fragments with lexemes and morphological categories. They are intended primarily to facilitate research that depends upon alignments, but they can be valuable on their own, whether or not there are other versions or alignments.

These files rely upon the grammatical rules declared for a given language, and those rules must be defined in a TAN-mor file. Therefore this section should be read in close conjunction with its companion: the section called “Morphological Concepts and Patterns (TAN-mor)”).

TAN-LM files are assumed to be applicable to texts in languages whose vocabulary lends itself to grammatical and lexicographical analysis. The two areas are interrelated but independent. If you wish, your TAN-LM file may contain only lexemes or only morphological analyses.

As an editor of a TAN-LM file you should understand the vocabulary and grammar of the languages you have picked. You should have a good sense of the rules established by the lexical and grammatical authorities you have chosen to follow. You should be familiar with the TAN-mor files you have chosen.

Although you must assume the point of view of a particular grammar and lexicon, you need not define those authorities, nor hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities.

Although TAN-LM files are simple, they can be laborious to read and write, more than other types of TAN files. It is customary for an editor of a TAN-LM file to use tools to help create and edit the data.

The root element of a lexico-morphological file is TAN-LM.

A mandatory <source> element points to the one and only TAN-T(EI) file that is the object of analysis.

<declarations> takes the elements common to class 2 files (see the section called “Class 2 Metadata (<head>)”. It takes two other elements unique to TAN-LM: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared, without regard for order.

There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, no validation tests will be performed.

You may find that one of the existing lexical models may be a suitable supplement to any TAN collections you develop. The TEI supports dictionary encoding, and the Lexical Markup Framework, an ISO standard (ISO-24613:2008), has defined a data model for lexicons and dictionaries. The former is geared toward philology and the latter toward linguistics. You may also devise your own format if neither of these support aspects of lexicology that you find important.

Because you or other TAN-LM editors are likely to be authorities in your own right, <agent> can be treated as if a <lexicon>, and be referred to by @lexicon in the <body> .

The <body> of a TAN-LM file takes, in addition to the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”), @lexicon and @morphology, to specify the default lexicon and grammar for the file. @lexicon may point either to a <lexicon> id or to an <agent> id (particularly useful for languages without good published lexica).

<body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s).

If due to tokenization a linguistic token must occupy more than one <tok>, you may use @cont in the first. It will be concatenated with the next one and treated as a single linguistic token.

The <tok> is assumed refer to as many tokens as there are @refs. If @pos or @chars has multiple values, these are to be treated as constituent parts of a single token found in each @ref.

Within an <ana>, claims are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted of every <tok>.

The principle of distribution allows, for example, an algorithm to economically populate a TAN-LM file with all the possible lexico-morphological possilibilities for a given word (for example, marking "down" as an adjective, an adverb, a noun, and a verb). But this freedom poses an interesting problem. In the case of automatically generation of lexico-morphological option, one does not wish to claim that a word really is every combination generated. But there are times when such ambiguity must be expressed (e.g., "down" in "Get down off a duck.").

Because multiple results from the automatic generation of lexico-morphological options are usually in doubt, it is advised that such results always include @cert. If it is likely that a given word has, in fact, only one lexico-morphological entry, but four options (two lexemes, each with two morphological codes) have been returned, then the algorithm should populate each <l> and <m> with @certs, each with the value of 0.5.