TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of Schematron for the grammar of human languages.
The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions.
This chapter should be read in close conjunction with the section called “Lexico-Morphology (<TAN-A-lm>)”.
Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.
TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work.
The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them.
The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.
Categorized codes are interpreted according to position. a b c
would mean something different than c b a
. For example, Perseus
(http://www.perseus.tufts.edu/hopper/) adopts categorized
codes for morphological analysis of Greek, Latin, and other highly inflected
languages. Every code has ten positions, each one corresponding to a major
grammatical category, with the first two being the major and minor parts of
speech, and the subsequent categories devoted to person, number, tense, and so
forth. Each word that is analyzed must have a value, even if a hyphen or null. A
d
in one position means something different from a d
in another.
Uncategorized codes, on the other hand, assign one unique code to each
grammatical feature. In this approach, codes may be combined and arranged at will.
a b c
would be identical to c b a
. This approach is
viable for any language (including highly inflected ones such as Greek or Latin),
but it is in practice most often found serving languages that are not highly
inflected, e.g., the Brown and Penn sets for English.
TAN-mor morphological codes may not include either the space or the hyphen, and
unlike IDrefs, they are case insensitive. The codes NOUN
and
noun
are interchangeable.
The root element of a morphological rule file is <TAN-mor>
.
Zero or more <source>
s
describe the grammars or related works that account for the morphological rules.
If the categories, codes, and rules are not based upon any published work, then
<source>
may be
omitted. Any TAN-mor file without a source may be inferred to be based upon the
personal knowledge of the persons or organizations identified in <file-resp>
.
<vocabulary-key>
is populated with the grammatical <feature>
s that are allowed grammatical concepts in
the language, and they are asigned codes via @xml:id
. Because a grammatical feature is not allowed in a
TAN-mor file until it is explicitly declared in a <feature>
, @xml:id
might simply repeat the value of @which
.
TAN has a standard vocabulary file for grammatical features:
vocabularies/features.TAN-voc.xml
. This vocabulary file encodes
746 vocabulary items corresponding to core grammatical features declared in the
OLiA Reference Model for Morphology, Morphosyntax and Syntax (http://purl.org/olia/olia.owl). See the section called “TAN keywords for features (<feature>)”.
If you wish to incorporate into your codes characters that are not allowed in
@xml:id
, e.g.,
$
or :
, you should create an <alias>
, whose @id
allows such values. <alias>
of course can be used to
assign multiple grammatical features to a single id.
<body>
)The <body>
of a TAN-mor
file takes the customary optional attributes found in other TAN files (see the section called “Edit Stamp”).
Within <body>
, you begin
with a language declaration: one or more <for-lang>
s.
After the language declaration come rules: zero or more <where>
s declaring rules to be
followed for the feature codes. <where>
has attributes that establish the context under
which its enclosed rules are operative. Those rules are found in the enclosed
<assert>
s or
<report>
s, which
declare rules that must be followed, or must never be followed, by any dependent
TAN-A-lm file.
An <assert>
and
<report>
will be
checked only if the conditions declared by the attributes in the enclosing
<where>
are met in
the context of a given <m>
:
@m-matches
(regular expression): <m>
matches the pattern.
@tok-matches
(regular expression): one of the values of <tok>
in the given
<ana>
matches
the pattern (regular expression).
@m-has-features
(space-delimited strings): <m>
has the specified
features.
@m-has-how-many-features
(integer): <m>
has the given number of
features.
An <assert>
also has one
or more of the truth conditions above. If the test proves false in a given
<m>
then the <m>
will be marked as erroneous and the
message included by the <assert>
should be returned.
<report>
has the same
effect, but the test looks for the opposite boolean value: the error and message
will be returned only if the test proves true.
After the rules come a structure declaration (if relying upon structured
codes): zero or more <category>
s . Each one sorts <feature>
s into groups, assigning them @code
values that are unique within
the <category>
. Sequence
is important. The first <category>
defines the features allowed in the first code
position, the second in the second, and so forth.
See sample TAN-mor files in the examples
directory.