TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices.
The TAN-mor format is like Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-LM files to see not only if the rules have been violated, but why, and exactly where.
This chapter should be read in close conjunction with the section called “Lexico-Morphology”.
Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.
TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing.
The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on those descriptions. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them.
The TAN-mor format has also been designed to cater to two different approaches to morphological codes: structured or unstructured.
Structured codes begin with set of major categories used to group morphological features. Structured codes tend to have a set number of code elements, and usually require gaps in the code. For example, the Perseus approach to the morphological categories of Greek, Latin, and other highly inflected languages dictate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null.
Unstructured codes do not attempt to categorize grammatical features, but simply give each one a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that have little inflection, e.g., the Brown and Penn sets for English.
The root element of a morphological rule file is <TAN-mor>
.
Zero or more <source>
elements describe the grammars or related works that account for the rules
declared in the TAN file. If the rules are not based upon any published work, then
<source>
may be
omitted. Any TAN-mor file without a source will assume to be based upon the
personal knowledge of the <agent>
s who edited the file.
<declarations>
is
empty.
<body>
)The <body>
of a TAN-mor
file takes the customary optional attributes found in other TAN files (see
@in-progress
and
the section called “Edit Stamp”).
The children of <body>
begin with one or more <for-lang>
s, followed by any number of <assert>
s, <report>
s, <feature>
s (for unstructured
codes), or <category>
s (if
relying upon structured codes).
<category>
, used for
structured codes, sorts <feature>
s into groups. @code
values must be unique within a
<category>
, but may
duplicate the @code
values of
<feature>
s from other
<category>
s. The
first <feature>
in a
<category>
describes
the category itself, and is not a <feature>
like the others.
The values and combinations of <feature>
s (or rather of the @code
s of <feature>
s) can be constrained
through <assert>
s and
<report>
s, which are
used to declare rules that must be followed, or must never be followed, by any
dependent TAN-LM file.
An <assert>
and
<report>
may be
restricted to specific features through @context
. If @context
is present, then <assert>
and <report>
declarations will be
checked in a TAN-LM file only against values of <m>
that invoke the feature; otherwise, all <m>
s will be tested. Four kinds of
tests are allowed:
@matches-m
:
indicates a regular expression pattern to be checked against the code in
an <m>
.
@matches-tok
: indicates a regular expression pattern to be
checked against the tokens picked by the values of <tok>
in a dependent TAN-LM
file.
@feature-test
: indicates features to be checked in the
content of <m>
s.
@feature-qty-test
: indicates the number of features
to be checked in the content of <m>
s.
An <assert>
indicates
that for any <m>
in any dependent
TAN-LM file, if the test proves false, and if the <m>
has a feature declared in @context
, then the <m>
should be marked as erroneous (or
merely a warning should be returned, if @cert
is present) and the message included by the <assert>
should be
returned.
<report>
has the same
effect, but the role of the test is the opposite: the error and message will be
returned only if the test proves true.