Lexico-Morphology (<TAN-A-lm>)

Lexico-Morphology (<TAN-A-lm>)
Prev	Chapter 6. Class-2 TAN Files, Annotations of Texts	Next

Lexico-Morphology (`<TAN-A-lm>`)

TAN-A-lm files are used to annotate the lexical and morphological character of individual tokens or morphemes.

These files have two kinds of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. This section therefore should be read in close conjunction with the section called “Morphological Concepts and Patterns (TAN-mor)”).

TAN-A-lm files are either source-specific or language-specific. Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, others upon a vast, general one. Source-specific TAN-A-lm files are useful for analyzing closely one particular text. Language-specific ones are useful for building language resources for computer applications.

Principles and Assumptions

Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used.

Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities.

Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.

Root Element and Header

The root element of a lexico-morphological file is TAN-A-lm.

If the file is source-specific, <source> points to the one and only TAN-T(EI) file that is the object of analysis. If the file is language-specific, <for-lang> is used in the declarations section of the <head> to indicate the languages that are covered. For language-specific TAN-A-lm files, this part of the <head> may also include <tok-starts-with> and <tok-is>, which improve performance when validating and processing numerous or large files.

There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file. Such a mechanism may be provided in a future version of TAN.

<vocabulary-key> takes the elements other class-2 files take (see the section called “Class 2 Metadata (<head>)”. It also permits two elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential.

There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited.

Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon.

Data (`<body>`)

The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see the section called “Edit Stamp”), @lexicon and @morphology, to specify the default lexicon and grammar.

<body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s).

An <ana> may take a @tok-pop, to specify the number of tokens that the assertion applies to. This is particularly helpful for language-specific files based upon a limited corpus of texts, where the underlying data for the assertion might be difficult or impossible to retrieve. The token population can be used to assign levels of certainty, or to compare statistical profiles of one TAN-A-lm file against another.

If you wish to point to a linguistic token that straddles more than one token, you should use multiple <tok>s, wrapping them in a <group>.

Any token may be the object of as many <ana>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives.

Claims within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok> or <group>.

Many TAN-A-lm files will be generated by an algorithm that automatically lists all possible morphological values of each token. It is advised that such automatic calculations always include in their output @cert, with weighted values. That is, if an algorithm identifies two possible lexico-morphological profiles for a word, but one occurs nine times more than the other, then it is advised that this be reflected in the two resultant elements, e.g.: <lm cert="0.9">...</lm> and <lm cert="0.1">...</lm>. If an algorithm is written with a more sophisticated way to weigh possibilities, then adjust the value of @cert accordingly. Be certain that the <algorithm> is credited in the <vocabulary-key> and in a <resp>.

As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process.

A TAN-A-lm file is rendered more efficient when claims can be grouped. If a particular token always has a particular lexico-morphological profile, this can be declared once, in a <tok> that does not have @ref, or it can be specified through a compound @ref. You do not need to provide a <tok> for every leaf div. In fact, such an approach can result in inefficient validation and processing.

For example, in early versions of TAN, the lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <tok>s grouped in 52,703 <ana>s (25.8 MB). Early 2020 validation routines took about 25 minutes (2018 validation routines took hours). That particular TAN-A-lm file itemized every single token in the text. It was revised to be more declarative along the lines advocated above. If a particular token had only one lexico-morphological profile throughout the text, then every instance was reduced to a single <ana>, with no @ref in <tok>. When a particular token value had different lexico-morphological profiles, @ref targeted the rootmost <div>. This revision resulted in a smaller file (15.8 MB; 158,376 <tok>s in 54,335 <ana>s) that validated in about a third of the time (8.5 minutes).

In general, there is always a trade-off between convenience and efficiency. If your priority is speed, you should break a large file into several smaller ones, perhaps recombining them in a master file via <inclusion> (see the section called “Networked Files”).

Prev	Up	Next
Token-Based Annotations and Alignments (<TAN-A-tok>)	Home	Chapter 7. Class-3 TAN Files, Varia