<TAN-A-lm>
)TAN-A-lm files are used to annotate the lexical and morphological character of individual tokens or morphemes.
These files have two kinds of dependencies: a class 1 source (optional) and the
grammatical rules defined in one or more TAN-mor files. This section therefore should
be read in close conjunction with the section called “Morphological Concepts and Patterns (TAN-mor
)”).
TAN-A-lm files are either source-specific or language-specific. Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, others upon a vast, general one. Source-specific TAN-A-lm files are useful for analyzing closely one particular text. Language-specific ones are useful for building language resources for computer applications.
Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used.
Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities.
Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.
The root element of a lexico-morphological file is TAN-A-lm.
If the file is source-specific, <source>
points to the one and only TAN-T(EI) file that is
the object of analysis. If the file is language-specific, <for-lang>
is used in the
declarations section of the <head>
to indicate the languages that are covered. For
language-specific TAN-A-lm files, this part of the <head>
may also include <tok-starts-with>
and
<tok-is>
, which
improve performance when validating and processing numerous or large files.
There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file. Such a mechanism may be provided in a future version of TAN.
<vocabulary-key>
takes the elements other class-2 files take (see the section called “Class 2 Metadata (<head>)”. It also permits two elements unique to TAN-A-lm: <lexicon>
(optional) and
<morphology>
(mandatory). Any number of lexica and morphologies may be declared; the order is
inconsequential.
There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited.
Because you or other TAN-A-lm editors are likely to be authorities in your own
right, <person>
can be
treated as if a <lexicon>
,
and be referred to by @lexicon
.
<body>
)The <body>
of a TAN-A-lm
file takes, in addition to the customary optional attributes found in other TAN
files (see the section called “Edit Stamp”), @lexicon
and @morphology
, to specify the default lexicon and
grammar.
<body>
has only one type of
child: one or more <ana>
s
(short for analysis), each of which matches one or more tokens (<tok>
) to one or more lexemes or
morphological assertions (<lm>
, which takes <l>
s and <m>
s).
An <ana>
may take a
@tok-pop
, to specify
the number of tokens that the assertion applies to. This is particularly helpful
for language-specific files based upon a limited corpus of texts, where the
underlying data for the assertion might be difficult or impossible to retrieve.
The token population can be used to assign levels of certainty, or to compare
statistical profiles of one TAN-A-lm file against another.
If you wish to point to a linguistic token that straddles more than one token,
you should use multiple <tok>
s,
wrapping them in a <group>
.
Any token may be the object of as many <ana>
s as you like. In fact, this is preferred if you
wish to register competing claims or alternatives.
Claims within an <ana>
are distributed. That is, every combination of <l>
and <m>
(governed by <lm>
) is asserted to be true for every <tok>
or <group>
.
Many TAN-A-lm files will be generated by an algorithm that automatically lists
all possible morphological values of each token. It is advised that such automatic
calculations always include in their output @cert
, with weighted values. That is, if an algorithm
identifies two possible lexico-morphological profiles for a word, but one occurs
nine times more than the other, then it is advised that this be reflected in the
two resultant elements, e.g.: <lm cert="0.9">...</lm>
and
<lm cert="0.1">...</lm>
. If an algorithm is written with a
more sophisticated way to weigh possibilities, then adjust the value of
@cert
accordingly. Be
certain that the <algorithm>
is credited in the <vocabulary-key>
and in a
<resp>
.
As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process.
A TAN-A-lm file is rendered more efficient when claims can be grouped. If a
particular token always has a particular lexico-morphological profile, this can be
declared once, in a <tok>
that
does not have @ref
, or it can be
specified through a compound @ref
. You do not need to provide a <tok>
for every leaf div. In fact,
such an approach can result in inefficient validation and processing.
For example, in early versions of TAN, the lexico-morphogical values of the
Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <tok>
s grouped in 52,703 <ana>
s (25.8 MB). Early 2020
validation routines took about 25 minutes (2018 validation routines took hours).
That particular TAN-A-lm file itemized every single token in the text. It was
revised to be more declarative along the lines advocated above. If a particular
token had only one lexico-morphological profile throughout the text, then every
instance was reduced to a single <ana>
, with no @ref
in <tok>
. When a particular token value had different
lexico-morphological profiles, @ref
targeted the rootmost <div>
. This revision resulted in a smaller file (15.8 MB;
158,376 <tok>
s in 54,335
<ana>
s) that validated
in about a third of the time (8.5 minutes).
In general, there is always a trade-off between convenience and efficiency. If
your priority is speed, you should break a large file into several smaller ones,
perhaps recombining them in a master file via <inclusion>
(see the section called “Networked Files”).