Morphological Concepts and Patterns (TAN-mor)

Morphological Concepts and Patterns (`TAN-mor`)
Prev	Chapter 7. Class-3 TAN Files, Varia	Next

Morphological Concepts and Patterns (`TAN-mor`)

TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices.

The TAN-mor format is a kind of Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-A-lm files to see not only if the rules have been violated, but why, and exactly where.

This chapter should be read in close conjunction with the section called “Lexico-Morphology”.

Principles and Assumptions

Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.

TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing.

The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories are defined and applied. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them.

The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.

Codes that are categorized are interpreted according not only to code but to position. For example, the categorized codes adopted by Perseus for morphological analysis of Greek, Latin, and other highly inflected languages stiplate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null, and the position of the code is important.

Uncategorized codes simply give each each grammatical feature a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that are not highly inflected, e.g., the Brown and Penn sets for English.

Root Element and Header

The root element of a morphological rule file is <TAN-mor>.

Zero or more <source> elements describe the grammars or related works that account for the rules declared in the TAN file. If the rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source will assume to be based upon the personal knowledge of the <person>s who edited the file.

<definitions> is populated with the grammatical <feature>s that are considered operative. If a particular discipline customarily uses codes that are not allowed in @xml:id, you may wish to create an <alias>.

Data (`<body>`)

The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”).

The children of <body> begin with one or more <for-lang>s, followed by any number of <where>s (containing <assert>s or <report>s) or <category>s (if relying upon structured codes).

<category>, used for structured codes, sorts <feature>s into groups, assigning them @code values that are unique within the <category>.

<assert>s and <report>s are used to declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file.

An <assert> and <report> will be checked only if the conditions in the enclosing <where> are met in the context of a given <m> in a dependent TAN-A-lm file:

@m-matches: <m> matches the pattern (regular expression).
@tok-matches: one of the values of <tok> in the given <ana> matches the pattern (regular expression).
@m-has-features: <m> has the specified features.
@m-has-how-many-features: <m> has the given number of features.

An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned.

<report> has the same effect, but the role of the test is the opposite: the error and message will be returned only if the test proves true.