Morphological Concepts and Patterns (TAN-mor)

TAN-mor files allow you to describe the morphological features of a given language, to assign codes of those features, and to define rules governing how those codes may be applied. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other coniditons, and warnings and error messages may be sent to users who may have used a pattern incorrectly, or not in accordance with best practices.

The TAN-mor format is like a Schematron, but for the grammatical analysis of human languages. You specify the categories and codes that prevail in the analysis of a given language, then you may provide tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-LM files to see not only if the rules have been violated, but why, and exactly where.

This chapter should be read in close conjunction with that pertaining to TAN-LM files, which exclusively depend upon them (the section called “Lexico-Morphology”).

Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design Principles”.

TAN-mor files are restricted exclusively to the description of the inflectional categories that characterize a language, and to declare rules that restrict how those categories can be declared or combined. Editors of these files should be familiar with the grammar of the languages they are describing.

The TAN-mor format has been designed under the assumption that word formation and inflection, in any given language, can be analyzed in different ways. It is also assumed that patterns of word inflection and formation can be categorized, classified, named, and described. Different views on the grammatical features and tendencies of a language should be declared, not suppressed. For example, not everyone agrees on the number of major parts of speech in English. And among those who think there are only eight, some name and define those categories in different ways. A mid-twentieth-century paradigm held to a major category called conjunctions, whereas most linguists now prefer to break this into two major categories, subordinators and coordinators.

The TAN-mor format has also been designed to cater to two approaches to coding the morphological features of a language: structured or unstructured.

Structured codes are created with a presumption of a set number of categories into which various features of morphology should be combined. Structured codes tend to have a set number of code elements, and usually require gaps in different spots. For example, the Perseus approach to the morphological categories of Greek, Latin, and other highly inflected languages dictate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null.

Morphology based on unstructured codes relies upon a single master set of tags for morphological features, which can be applied in any sequence and combination (other rules permitting). This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that have little inflection, e.g., the Brown and Penn sets for English.

The root element of a morphological rule file is <TAN-mor>.

Zero or more <source> elements describe the grammars or related works that account for the rules declared in the TAN file. If the rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source will assume to be based upon the personal knowledge of the <agent>s who edited the file.

<declarations> is empty.

takes two types of element, that take a standard language code (see the section called “Languages”) and one or more that declare the morphological features that characterize the languages being described. This forms a kind of morphological alphabet from which both structured and unstructured approaches begin.

The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see @in-progress and the section called “Edit Stamp”).

The children of <body> begin with one or more <for-lang>s, followed by any number of <assert>s, <report>s, <feature>s, or <category>s (if relying upon structured codes).

<category> allows you to sort <feature>s into groups. This technique may be useful for languages that have numerous morphological features that are traditionally grouped to support complex codes. A common way to handle the morphology of Greek, for example, is through a ten-character code that indicates the major part of speech, person, number, and so forth. This requires users of the TAN-mor file to respect the order in which various codes appear. The first <feature> in a <category> describes the category itself, and is not a <feature> like the others.

This approach is only optional and probably not the best way to declare the morphological features of a language. By dispensing with <category>, you allow users to access the @xml:id value of <feature> directly, and to list features in any order they wish.

The values and combinations of <feature>s (or rather of the @codes of <feature>s) can be constrained through <assert>s and <report>s, which are used to declare rules that must be followed, or must never be used, by any dependent TAN-LM file.

An <assert> and <report> may be restricted to specific features through @context. If @context is present, then <assert> and <report> declarations will be checked in a TAN-LM file only against values of <m> that invoke the feature; otherwise, all <m>s will be tested. Four kinds of tests are allowed (for details of any given test, follow the link):

  • @matches-m: indicates a regular expression pattern to be checked against the code in an <m>.

  • @matches-tok: indicates a regular expression pattern to be checked against the tokens picked by the values of <tok> in a dependent TAN-LM file.

  • @feature-test: indicates features to be checked in the content of <m>s.

  • @feature-qty-test: indicates the number of features to be checked in the content of <m>s.

An <assert> indicates that for any <m> in any dependent TAN-LM file, if the test proves false, and if the <m> has a feature declared in @context, then the <m> should be marked as erroneous (or merely a warning should be returned, if @cert is present) and the message included by the <assert> should be returned.

<report> has the same effect, but the role of the test is the opposite: the error and message will be returned only if the test proves true.