Morphological Concepts and Patterns (TAN-mor)

Morphological Concepts and Patterns (`TAN-mor`)
Prev	Chapter 7. Class-3 TAN Files, Varia	Next

Morphological Concepts and Patterns (`TAN-mor`)

TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of schema language for the grammar of human languages.

The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions.

This chapter should be read in close conjunction with the section called “Lexico-morphology (<TAN-A-lm>)”.

Principles and Assumptions

Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see the section called “Design principles”.

TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work.

The TAN-mor format has been designed under the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how grammatical features should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them.

The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized.

Categorized codes are interpreted according to position. a b c would mean something different than c b a. For example, Perseus (http://www.perseus.tufts.edu/hopper/) has traditionally categorized codes for morphological analysis of Greek, Latin, and other highly inflected languages. Every code has ten positions, each one corresponding to a major grammatical category, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if a hyphen or null. A d in one position means something different from a d in another.

Uncategorized codes, on the other hand, assign one unique code to each grammatical feature. In this approach, codes may be combined and arranged at will. a b c would be identical to c b a. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is in practice most often applied to languages that are not highly inflected, e.g., the Brown and Penn sets for English.

TAN-mor morphological codes may not include either the space or the hyphen, and unlike IDrefs, they are case insensitive. For example, the codes NOUN and noun are interchangeable.

Root Element and Header

The root element of a morphological rule file is <TAN-mor>.

Zero or more <source>s refer to the grammars or related works that account for the morphological rules. If the categories, codes, and rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source may be inferred to be based upon the personal knowledge of the persons or organizations identified in <file-resp>.

A language declaration is made in the header: one or more <for-lang>s.

Data (`<body>`)

The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see the section called “Edit stamp”).

<body> contains interleaved rules and grammatical codes, either categorized or not.

Grammatical rules consist of a series of <rule>s, perhaps filtered by attribute tests, and perhaps filtered by children <where>s with attribute tests. These tests are evaluated against the context various <m>s in a dependent TAN-A-lm file.

Attribute tests are as follows:

@m-matches (regular expression): <m> matches the pattern.
@tok-matches (regular expression): one of the values of <tok> in the given <ana> matches the pattern (regular expression).
@m-has-codes (space-delimited strings): <m> has the specified feature codes.
@m-has-how-many-codes (integer): <m> has the given number of feature codes.

If all the attributes in a <rule> or any of its children <where>s evaluate true against a context, then the process allows the actual ruels to be evaluated. Those rules are found in the enclosed <assert>s or <report>s, which declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file.

An <assert> and <report> will be checked only if the conditions declared by the attributes in the enclosing <where> are met :

An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned.

<report> has the same effect, but the test looks for the opposite boolean value: the error and message will be returned only if the test proves true.

Mixed with the rules are codes, either categorized or not.

If categorized, there are zero or more <category>s . Each one sorts <code>s into groups, assigning them <val> that are unique within the <category>. Sequence is important. The first <category> defines the features allowed in the first code position, the second in the second, and so forth.

If not categorized, then there are simply one or more <code>s. Each <code> has a @feature that points to one or more vocabulary items for a grammatical feature, either by IDref or by name.

TAN has a standard vocabulary file for grammatical features: vocabularies/features.TAN-voc.xml. This vocabulary file encodes 746 grammatical features declared in the OLiA Reference Model for Morphology, Morphosyntax and Syntax (http://purl.org/olia/olia.owl). See the section called “TAN keywords for features (<feature>)”.

<code> must have a <val>, which contains the actual code used, and it may take one or more <desc>s, to explain how the grammatical features should be interpreted for a given language. This is the ideal place to provide examples.

In addition to examples below, see sample TAN-mor files in the examples directory.

Example 7.1. Examples of rules and codes

<rule m-has-how-many-codes="2-10">
   <report m-matches="^c">A conjunction has no other inflectional
      properties.</report>
   <report m-matches="^r">A preposition has no other inflectional
      properties.</report>
   <report m-matches="^i">An interjection has no other inflectional
      properties.</report>
   <report m-matches="^y">An acronym has no other inflectional properties.</report>
</rule>
. . . . . . .
<rule m-matches="^. i">
   <assert m-matches="^[dp]">An interrogative must be either a determiner (d) or a
      pronoun (p).</assert>
</rule>
. . . . . .
<code feature="accusative"><val>accusative</val></code>
<code feature="nominative"><val>nominative</val></code>
<code feature="case_dative"><val>dative</val></code>
<code feature="case_genitive"><val>genitive</val></code>
<code feature="case_vocative"><val>vocative</val></code>
. . . . . .
<category feature="feature_person">
   <code feature="first"><val>1</val></code>
   <code feature="second"><val>2</val></code>
   <code feature="third"><val>3</val></code>
</category>