Morphological Concepts and Patterns (TAN-mor)

Morphological Concepts and Patterns (`TAN-mor`)
Prev	Chapter 7. Class-3 TAN Files, Varia	Next

Morphological Concepts and Patterns (`TAN-mor`)

TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of Schematron for the grammar of human languages.

The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions.

This chapter should be read in close conjunction with the section called “Lexico-Morphology (<TAN-A-lm>)”.

Principles and Assumptions

Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.

TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work.

The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them.

The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.

Categorized codes are interpreted according to position. a b c would mean something different than c b a. For example, Perseus (http://www.perseus.tufts.edu/hopper/) adopts categorized codes for morphological analysis of Greek, Latin, and other highly inflected languages. Every code has ten positions, each one corresponding to a major grammatical category, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if a hyphen or null. A d in one position means something different from a d in another.

Uncategorized codes, on the other hand, assign one unique code to each grammatical feature. In this approach, codes may be combined and arranged at will. a b c would be identical to c b a. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is in practice most often found serving languages that are not highly inflected, e.g., the Brown and Penn sets for English.

TAN-mor morphological codes may not include either the space or the hyphen, and unlike IDrefs, they are case insensitive. The codes NOUN and noun are interchangeable.

Root Element and Header

The root element of a morphological rule file is <TAN-mor>.

Zero or more <source>s describe the grammars or related works that account for the morphological rules. If the categories, codes, and rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source may be inferred to be based upon the personal knowledge of the persons or organizations identified in <file-resp>.

<vocabulary-key> is populated with the grammatical <feature>s that are allowed grammatical concepts in the language, and they are asigned codes via @xml:id. Because a grammatical feature is not allowed in a TAN-mor file until it is explicitly declared in a <feature>, @xml:id might simply repeat the value of @which.

TAN has a standard vocabulary file for grammatical features: vocabularies/features.TAN-voc.xml. This vocabulary file encodes 746 vocabulary items corresponding to core grammatical features declared in the OLiA Reference Model for Morphology, Morphosyntax and Syntax (http://purl.org/olia/olia.owl). See the section called “TAN keywords for features (<feature>)”.

If you wish to incorporate into your codes characters that are not allowed in @xml:id, e.g., $ or :, you should create an <alias>, whose @id allows such values. <alias> of course can be used to assign multiple grammatical features to a single id.

Data (`<body>`)

The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see the section called “Edit Stamp”).

Within <body>, you begin with a language declaration: one or more <for-lang>s.

After the language declaration come rules: zero or more <where>s declaring rules to be followed for the feature codes. <where> has attributes that establish the context under which its enclosed rules are operative. Those rules are found in the enclosed <assert>s or <report>s, which declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file.

An <assert> and <report> will be checked only if the conditions declared by the attributes in the enclosing <where> are met in the context of a given <m>:

@m-matches (regular expression): <m> matches the pattern.
@tok-matches (regular expression): one of the values of <tok> in the given <ana> matches the pattern (regular expression).
@m-has-features (space-delimited strings): <m> has the specified features.
@m-has-how-many-features (integer): <m> has the given number of features.

An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned.

<report> has the same effect, but the test looks for the opposite boolean value: the error and message will be returned only if the test proves true.

After the rules come a structure declaration (if relying upon structured codes): zero or more <category>s . Each one sorts <feature>s into groups, assigning them @code values that are unique within the <category>. Sequence is important. The first <category> defines the features allowed in the first code position, the second in the second, and so forth.

See sample TAN-mor files in the examples directory.