Morphological Concepts and Patterns (TAN-mor)

TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices.

The TAN-mor format is like Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-LM files to see not only if the rules have been violated, but why, and exactly where.

This chapter should be read in close conjunction with the section called “Lexico-Morphology”.

Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.

TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing.

The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on those descriptions. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them.

The TAN-mor format has also been designed to cater to two different approaches to morphological codes: structured or unstructured.

Structured codes begin with set of major categories used to group morphological features. Structured codes tend to have a set number of code elements, and usually require gaps in the code. For example, the Perseus approach to the morphological categories of Greek, Latin, and other highly inflected languages dictate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null.

Unstructured codes do not attempt to categorize grammatical features, but simply give each one a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that have little inflection, e.g., the Brown and Penn sets for English.

The root element of a morphological rule file is <TAN-mor>.

Zero or more <source> elements describe the grammars or related works that account for the rules declared in the TAN file. If the rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source will assume to be based upon the personal knowledge of the <agent>s who edited the file.

<declarations> is empty.

The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”).

The children of <body> begin with one or more <for-lang>s, followed by any number of <assert>s, <report>s, <feature>s (for unstructured codes), or <category>s (if relying upon structured codes).

<category>, used for structured codes, sorts <feature>s into groups. @code values must be unique within a <category>, but may duplicate the @code values of <feature>s from other <category>s. The first <feature> in a <category> describes the category itself, and is not a <feature> like the others.

The values and combinations of <feature>s (or rather of the @codes of <feature>s) can be constrained through <assert>s and <report>s, which are used to declare rules that must be followed, or must never be followed, by any dependent TAN-LM file.

An <assert> and <report> may be restricted to specific features through @context. If @context is present, then <assert> and <report> declarations will be checked in a TAN-LM file only against values of <m> that invoke the feature; otherwise, all <m>s will be tested. Four kinds of tests are allowed:

  • @matches-m: indicates a regular expression pattern to be checked against the code in an <m>.

  • @matches-tok: indicates a regular expression pattern to be checked against the tokens picked by the values of <tok> in a dependent TAN-LM file.

  • @feature-test: indicates features to be checked in the content of <m>s.

  • @feature-qty-test: indicates the number of features to be checked in the content of <m>s.

An <assert> indicates that for any <m> in any dependent TAN-LM file, if the test proves false, and if the <m> has a feature declared in @context, then the <m> should be marked as erroneous (or merely a warning should be returned, if @cert is present) and the message included by the <assert> should be returned.

<report> has the same effect, but the role of the test is the opposite: the error and message will be returned only if the test proves true.