TAN-LM files are used to associate words or word fragments with lexemes and morphological categories. They are intended primarily to facilitate research that depends upon alignments, but they can be valuable on their own, whether or not there are other versions or alignments.
These files rely upon the grammatical rules defined for a given language in a TAN-mor file. Therefore this section should be read in close conjunction with its companion: the section called “Morphological Concepts and Patterns (TAN-mor)”).
TAN-LM files are assumed to be applicable to texts in languages whose vocabulary lends itself to grammatical and lexicographical analysis. The two areas are interrelated but independent. If you wish, your TAN-LM file may contain only lexemes or only morphological analyses.
As an editor of a TAN-LM file you should understand the vocabulary and grammar of the languages you have picked. You should have a good sense of the rules established by the lexical and grammatical authorities you have chosen to follow. You should be familiar with the conventions and assumptions of the TAN-mor files you have adopted.
Although you must assume the point of view of a particular grammar and lexicon, you need not define those authorities, nor hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities.
Although TAN-LM files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-LM file to use tools to help create and edit the data.
The root element of a lexico-morphological file is TAN-LM.
TAN-LM files are either source-specific or language-specific. In the case of
the former, <source>
points
to the one and only TAN-T(EI) file that is the object of analysis. In the case of
the latter, <for-lang>
is
used to indicate the languages that are covered.
Note | |
---|---|
If the language-specific option is exercised, the file must point to TAN-LM-lang schema files. See the section called “Overall Structure (root)”. |
<declarations>
takes the elements common to class 2 files (see the section called “Class 2 Metadata (<head>)”. It takes two other elements unique to TAN-LM: <lexicon>
(optional) and
<morphology>
(mandatory). Any number of lexica and morphologies may be declared; the order is
inconsequential.
There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the the section called “Digital Entity Metadata Pattern”, no validation tests will be performed.
You may find a non-TAN lexical model to be a suitable supplement to any TAN collections you develop. The TEI supports dictionary encoding, and the Lexical Markup Framework, an ISO standard (ISO-24613:2008), has defined a data model for lexicons and dictionaries. The former is geared toward philology and the latter toward linguistics. You may also devise your own format if neither of these support aspects of lexicology that you find important.
Because you or other TAN-LM editors are likely to be authorities in your own
right, <agent>
can be
treated as if a <lexicon>
,
and be referred to by @lexicon
in the <body>
.
<body>
)The <body>
of a TAN-LM file
takes, in addition to the customary optional attributes found in other TAN files
(see @in-progress
and
the section called “Edit Stamp”), @lexicon
and @morphology
, to specify the default lexicon and grammar for
the file. @lexicon
may point
either to a <lexicon>
id or
to an <agent>
id (when
someone editing the TAN file is an authority).
<body>
has only one type of
child: one or more <ana>
s
(short for analysis), each of which matches one or more tokens (<tok>
) to one or more lexemes or
morphological assertions (<lm>
, which takes <l>
s and <m>
s).
If due to tokenization a linguistic token must occupy more than one <tok>
, you may use @cont
to group <tok>
s together.
Elements within an <ana>
are distributed, to allow economically sized files. That is, every combination of
<l>
and <m>
(governed by <lm>
) is asserted to be true for
every <tok>
.
Many TAN-LM files will be populated by a stylesheet or other algorithm that
automatically calculate the possible morphological values of each token, for
example, "down" being marked as an adjective, an adverb, a noun, and a verb. In
this case, you does not wish to claim that a word really is every combination
generated. But you do wish to leave open the possibility for cases where such
ambiguity must be expressed (e.g., "down" in "Get down off a duck." being equally
a noun and adverb). It is advised that automatically calculated results always
include @cert
with weighted
values that sum to 1 for each token.