TAN-A-lm files are used to associate words or word fragments with lexemes and morphological categories.
These files have two kinds of dependencies: a class 1 source (optional) and the
grammatical rules defined in one or more TAN-mor files. Therefore this section should
be read in close conjunction with its companion: the section called “Morphological Concepts and Patterns (TAN-mor
)”).
Editors of TAN-A-lm files should understand the vocabulary and grammar of the chosen languages. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files you have adopted.
Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities.
Although TAN-A-lm files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-A-lm file to use tools to help create and edit the data.
The root element of a lexico-morphological file is TAN-A-lm.
TAN-A-lm files are either source-specific or language-specific. In the case of
the former, <source>
points
to the one and only TAN-T(EI) file that is the object of analysis. In the case of
the latter, <for-lang>
is
used to indicate the languages that are covered.
<definitions>
takes the elements common to class 2 files (see the section called “Class 2 Metadata (<head>)”. It takes two other elements unique to TAN-A-lm: <lexicon>
(optional) and
<morphology>
(mandatory). Any number of lexica and morphologies may be declared; the order is
inconsequential.
There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, validation tests do not take this element into account.
Because you or other TAN-A-lm editors are likely to be authorities in your own
right, <person>
can be
treated as if a <lexicon>
,
and be referred to by @lexicon
in the <body>
.
<body>
)The <body>
of a TAN-A-lm
file takes, in addition to the customary optional attributes found in other TAN
files (see @in-progress
and the section called “Edit Stamp”), @lexicon
and @morphology
, to specify the default lexicon and
grammar.
<body>
has only one type of
child: one or more <ana>
s
(short for analysis), each of which matches one or more tokens (<tok>
) to one or more lexemes or
morphological assertions (<lm>
, which takes <l>
s and <m>
s).
If due to tokenization a linguistic token must occupy more than one <tok>
, you may use <group>
to group <tok>
s together.
Elements within an <ana>
are distributed. That is, every combination of <l>
and <m>
(governed by <lm>
) is asserted to be true for every <tok>
.
Many TAN-A-lm files will be populated by a stylesheet or other algorithm that
automatically lists all possible morphological values of each token. It is advised
that such automatically calculated results always include @cert
with weighted values.