Lexico-Morphology

TAN-LM files are used to associate words or word fragments with lexemes and morphological categories. They are intended primarily to facilitate research that depends upon alignments, but they can be valuable on their own, whether or not there are other versions or alignments.

These files rely upon the grammatical rules defined for a given language in a TAN-mor file. Therefore this section should be read in close conjunction with its companion: the section called “Morphological Concepts and Patterns (TAN-mor)”).

TAN-LM files are assumed to be applicable to texts in languages whose vocabulary lends itself to grammatical and lexicographical analysis. The two areas are interrelated but independent. If you wish, your TAN-LM file may contain only lexemes or only morphological analyses.

As an editor of a TAN-LM file you should understand the vocabulary and grammar of the languages you have picked. You should have a good sense of the rules established by the lexical and grammatical authorities you have chosen to follow. You should be familiar with the conventions and assumptions of the TAN-mor files you have adopted.

Although you must assume the point of view of a particular grammar and lexicon, you need not define those authorities, nor hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities.

Although TAN-LM files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-LM file to use tools to help create and edit the data.

The root element of a lexico-morphological file is TAN-LM.

TAN-LM files are either source-specific or language-specific. In the case of the former, <source> points to the one and only TAN-T(EI) file that is the object of analysis. In the case of the latter, <for-lang> is used to indicate the languages that are covered.

[Note]Note

If the language-specific option is exercised, the file must point to TAN-LM-lang schema files. See the section called “Overall Structure (root)”.

<declarations> takes the elements common to class 2 files (see the section called “Class 2 Metadata (<head>)”. It takes two other elements unique to TAN-LM: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential.

There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, no validation tests will be performed.

You may find a non-TAN lexical model to be a suitable supplement to any TAN collections you develop. The TEI supports dictionary encoding, and the Lexical Markup Framework, an ISO standard (ISO-24613:2008), has defined a data model for lexicons and dictionaries. The former is geared toward philology and the latter toward linguistics. You may also devise your own format if neither of these support aspects of lexicology that you find important.

Because you or other TAN-LM editors are likely to be authorities in your own right, <agent> can be treated as if a <lexicon>, and be referred to by @lexicon in the <body> .

The <body> of a TAN-LM file takes, in addition to the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”), @lexicon and @morphology, to specify the default lexicon and grammar for the file. @lexicon may point either to a <lexicon> id or to an <agent> id (when someone editing the TAN file is an authority).

<body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s).

If due to tokenization a linguistic token must occupy more than one <tok>, you may use @cont to group <tok>s together.

Elements within an <ana> are distributed, to allow economically sized files. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok>.

Many TAN-LM files will be populated by a stylesheet or other algorithm that automatically calculate the possible morphological values of each token, for example, "down" being marked as an adjective, an adverb, a noun, and a verb. In this case, you does not wish to claim that a word really is every combination generated. But you do wish to leave open the possibility for cases where such ambiguity must be expressed (e.g., "down" in "Get down off a duck." being equally a noun and adverb). It is advised that automatically calculated results always include @cert with weighted values that sum to 1 for each token.