TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of schema language for the grammar of human languages.
The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions.
This chapter should be read in close conjunction with the section called “Lexico-morphology (<TAN-A-lm>)”.
Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design principles”.
TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work.
The TAN-mor format has been designed under the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how grammatical features should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them.
The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.
Categorized codes are interpreted according to position. a b c
would mean something different than c b a
. For example, Perseus
(http://www.perseus.tufts.edu/hopper/) has traditionally
categorized codes for morphological analysis of Greek, Latin, and other highly
inflected languages. Every code has ten positions, each one corresponding to a
major grammatical category, with the first two being the major and minor parts of
speech, and the subsequent categories devoted to person, number, tense, and so
forth. Each word that is analyzed must have a value, even if a hyphen or null. A
d
in one position means something different from a d
in another.
Uncategorized codes, on the other hand, assign one unique code to each
grammatical feature. In this approach, codes may be combined and arranged at will.
a b c
would be identical to c b a
. This approach is
viable for any language (including highly inflected ones such as Greek or Latin),
but it is in practice most often applied to languages that are not highly
inflected, e.g., the Brown and Penn sets for English.
TAN-mor morphological codes may not include either the space or the hyphen, and
unlike IDrefs, they are case insensitive. For example, the codes NOUN
and noun
are interchangeable.
The root element of a morphological rule file is <TAN-mor>
.
Zero or more <source>
s
refer to the grammars or related works that account for the morphological rules.
If the categories, codes, and rules are not based upon any published work, then
<source>
may be
omitted. Any TAN-mor file without a source may be inferred to be based upon the
personal knowledge of the persons or organizations identified in <file-resp>
.
A language declaration is made in the header: one or more <for-lang>
s.
<body>
)The <body>
of a TAN-mor
file takes the customary optional attributes found in other TAN files (see the section called “Edit stamp”).
<body>
contains interleaved
rules and grammatical codes, either categorized or not.
Grammatical rules consist of a series of <rule>
s, perhaps filtered by attribute tests, and
perhaps filtered by children <where>
s with attribute tests. These tests are
evaluated against the context various <m>
s in a dependent TAN-A-lm file.
Attribute tests are as follows:
@m-matches
(regular expression): <m>
matches the pattern.
@tok-matches
(regular expression): one of the values of <tok>
in the given
<ana>
matches
the pattern (regular expression).
@m-has-codes
(space-delimited strings): <m>
has the specified feature codes.
@m-has-how-many-codes
(integer): <m>
has the given number of
feature codes.
If all the attributes in a <rule>
or any of its children <where>
s evaluate true against a
context, then the process allows the actual ruels to be evaluated. Those rules are
found in the enclosed <assert>
s or <report>
s, which declare rules that must be followed, or
must never be followed, by any dependent TAN-A-lm file.
An <assert>
and
<report>
will be
checked only if the conditions declared by the attributes in the enclosing
<where>
are met
:
An <assert>
also has one
or more of the truth conditions above. If the test proves false in a given
<m>
then the <m>
will be marked as erroneous and the
message included by the <assert>
should be returned.
<report>
has the same
effect, but the test looks for the opposite boolean value: the error and message
will be returned only if the test proves true.
Mixed with the rules are codes, either categorized or not.
If categorized, there are zero or more <category>
s . Each one sorts <code>
s into groups, assigning them <val>
that are unique within the
<category>
. Sequence
is important. The first <category>
defines the features allowed in the first code
position, the second in the second, and so forth.
If not categorized, then there are simply one or more <code>
s. Each <code>
has a @feature
that points to one or
more vocabulary items for a grammatical feature, either by IDref or by name.
TAN has a standard vocabulary file for grammatical features:
vocabularies/features.TAN-voc.xml
. This vocabulary file encodes
746 grammatical features declared in the OLiA Reference Model for Morphology,
Morphosyntax and Syntax (http://purl.org/olia/olia.owl). See
the section called “TAN keywords for features (<feature>)”.
<code>
must have a <val>
, which contains the actual
code used, and it may take one or more <desc>
s, to explain how the grammatical features should
be interpreted for a given language. This is the ideal place to provide
examples.
In addition to examples below, see sample TAN-mor files in the
examples
directory.
Example 7.1. Examples of rules and codes
<rule m-has-how-many-codes="2-10"> <report m-matches="^c">A conjunction has no other inflectional properties.</report> <report m-matches="^r">A preposition has no other inflectional properties.</report> <report m-matches="^i">An interjection has no other inflectional properties.</report> <report m-matches="^y">An acronym has no other inflectional properties.</report> </rule> . . . . . . . <rule m-matches="^. i"> <assert m-matches="^[dp]">An interrogative must be either a determiner (d) or a pronoun (p).</assert> </rule> . . . . . . <code feature="accusative"><val>accusative</val></code> <code feature="nominative"><val>nominative</val></code> <code feature="case_dative"><val>dative</val></code> <code feature="case_genitive"><val>genitive</val></code> <code feature="case_vocative"><val>vocative</val></code> . . . . . . <category feature="feature_person"> <code feature="first"><val>1</val></code> <code feature="second"><val>2</val></code> <code feature="third"><val>3</val></code> </category>