TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices.
The TAN-mor format is a kind of Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-A-lm files to see not only if the rules have been violated, but why, and exactly where.
This chapter should be read in close conjunction with the section called “Lexico-Morphology”.
Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.
TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing.
The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories are defined and applied. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them.
The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.
Codes that are categorized are interpreted according not only to code but to position. For example, the categorized codes adopted by Perseus for morphological analysis of Greek, Latin, and other highly inflected languages stiplate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null, and the position of the code is important.
Uncategorized codes simply give each each grammatical feature a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that are not highly inflected, e.g., the Brown and Penn sets for English.
The root element of a morphological rule file is <TAN-mor>
.
Zero or more <source>
elements describe the grammars or related works that account for the rules
declared in the TAN file. If the rules are not based upon any published work, then
<source>
may be
omitted. Any TAN-mor file without a source will assume to be based upon the
personal knowledge of the <person>
s who edited the file.
<definitions>
is
populated with the grammatical <feature>
s that are considered operative. If a
particular discipline customarily uses codes that are not allowed in @xml:id
, you may wish to create an
<alias>
.
<body>
)The <body>
of a TAN-mor
file takes the customary optional attributes found in other TAN files (see
@in-progress
and
the section called “Edit Stamp”).
The children of <body>
begin with one or more <for-lang>
s, followed by any number of <where>
s (containing <assert>
s or <report>
s) or <category>
s (if relying upon
structured codes).
<category>
, used for
structured codes, sorts <feature>
s into groups, assigning them @code
values that are unique within
the <category>
.
<assert>
s and <report>
s are used to declare
rules that must be followed, or must never be followed, by any dependent TAN-A-lm
file.
An <assert>
and
<report>
will be
checked only if the conditions in the enclosing <where>
are met in the context
of a given <m>
in a dependent
TAN-A-lm file:
@m-matches
:
<m>
matches the
pattern (regular expression).
@tok-matches
: one of the values of <tok>
in the given
<ana>
matches
the pattern (regular expression).
@m-has-features
: <m>
has the specified features.
@m-has-how-many-features
: <m>
has the given number of
features.
An <assert>
also has one
or more of the truth conditions above. If the test proves false in a given
<m>
then the <m>
will be marked as erroneous and the
message included by the <assert>
should be returned.
<report>
has the same
effect, but the role of the test is the opposite: the error and message will be
returned only if the test proves true.