Lexico-Morphology

Lexico-Morphology
Prev	Chapter 6. Class-2 TAN Files, Annotations of Texts	Next

TAN-A-lm files are used to associate words or word fragments with lexemes and morphological categories.

These files have two kinds of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. Therefore this section should be read in close conjunction with its companion: the section called “Morphological Concepts and Patterns (TAN-mor)”).

Principles and Assumptions

Editors of TAN-A-lm files should understand the vocabulary and grammar of the chosen languages. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files you have adopted.

Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities.

Although TAN-A-lm files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-A-lm file to use tools to help create and edit the data.

Root Element and Header

The root element of a lexico-morphological file is TAN-A-lm.

TAN-A-lm files are either source-specific or language-specific. In the case of the former, <source> points to the one and only TAN-T(EI) file that is the object of analysis. In the case of the latter, <for-lang> is used to indicate the languages that are covered.

<definitions> takes the elements common to class 2 files (see the section called “Class 2 Metadata (<head>)”. It takes two other elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential.

There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, validation tests do not take this element into account.

Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon in the <body> .

Data (`<body>`)

The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”), @lexicon and @morphology, to specify the default lexicon and grammar.

<body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s).

If due to tokenization a linguistic token must occupy more than one <tok>, you may use <group> to group <tok>s together.

Elements within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok>.

Many TAN-A-lm files will be populated by a stylesheet or other algorithm that automatically lists all possible morphological values of each token. It is advised that such automatically calculated results always include @cert with weighted values.

Prev	Up	Next
Token-Based Annotations and Alignments (<TAN-A-tok>)	Home	Chapter 7. Class-3 TAN Files, Varia