Chapter 6. Class-2 TAN Files, Annotations of Texts

Chapter 6. Class-2 TAN Files, Annotations of Texts
Prev	Part II. Detailed Description	Next

This chapter provides general background to the elements and attributes that are common to class 2 TAN files. For detailed discussion see Chapter 8, TAN patterns, elements, and attributes defined.

At present, class 2 files are restricted to alignment or lexico-morphology.

Alignment files come in two different formats, identified by the root element. TAN-A-div provides macroscopic alignment; TAN-A-tok, microscopic. TAN-A-div aligns one or more class 1 files. It is intended for broad, general alignments of any number of versions of any number of works. The scope of TAN-A-tok is more restricted, to two class 1 files, allowing one to declare alignments with detailed specificity, certainty, and type between words (tokens). TAN-A-div focuses on works, regardless of version; TAN-A-tok focuses on individual versions.

Lexico-morphology files (also called part-of-speech files), TAN-LM, are used to encode the lexical headwords and morphological forms of individual words in class 1 files.

Common Elements

The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'."

The TAN pointer syntax differs from other pointer systems (e.g., URLs, XPath, and XPointer) in that it depends upon a hierarchy of four features: works, divisions, word tokens, and characters. Works, defined above (see the section called “One work”), are defined by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of those divisions, defined according to one or more tokenization rules. And characters are defined as non-modifying codepoints in a word token. (A modifying character are treated as a piece with the non-modifying base character it modifies.)

Parts of this fourfold hierarchy—works, divisions, tokens, and characters—are named with vocabulary that the editor of a class 2 file finds most useful. Sources are given a nickname (e.g., xml:id = "hamlet-1741"); divisions are named using the values for @n; tokens are referred to by position, by their actual values, or both (e.g., pos = "1 - 5", pos = "last-1 - last", val = "hath"; see the section called “@pos and @val”). Characters are always identified by number (e.g., chars = "2, 7").

This approach not only makes the syntax human readable, it also mitigates any disruptions that corrections or alterations might incur. For example, if an incorrectly duplicated <div> is deleted, disruption to the reference system is isolated and does not affect the rest of the document.

Class 2 Validation

Some Class 2 files may be time-consuming to validate fully. The length of the <body> could be enormous. Or the number and length of sources may be taxing. Or validation may depend upon time-consuming transformations of the source documents. Most oftentimes, this problem affects TAN-A-div files, so to facilitate editing within an XML editor, where regular validation is essential, Schematron validation falls into one of two phases:

basic: All regular Schematron tests are suspended, and reports are devoted exclusively to assisting in looking for and checking the validity of references in <div-ref> and <tok>.
verbose: complete testing of class-2 files, including checks on source files to determine whether they adhere to the LDUR (see the section called “Flattened References, and the Leaf Div Uniqueness Rule”). In addition, information is given on where there are discrepancies in the numeration system across versions of the same work.

If you do not specify in the prolog which phase you intend to be the default, you will be prompted for the phase you wish to use whenever you validate the file.

Class 2 Metadata (`<head>`)

Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above.

All class 2 files have as their sources nothing other than class 1 files. Therefore each <source> must take the the section called “Digital Entity Metadata Pattern”. Because the rights have already been declared in the source files, <rights-source-only> is disallowed.

Editors of class 2 files must be able to name or number word-tokens in a transcription, via an optional <token-definition>. See the section called “Defining Words and Tokens”.

There may be some cases where a source has a div type that is unnecessary, is confusing, or should be ignored. One or more optional <suppress-div-types>s may be used to specify division types that you wish to suppress in references.

Optional <rename-div-ns> provide a convenient way to provisionally rename @n values. This is useful for cases where you wish to use division labels that more familiar to users of the class 2 files, or are easier to edit and read. It can also be used to harmonize discordant @n values, especially helpful for divs that are named, not numbered, such as the books of the Bible.

Class 2 Data Patterns (`<body>`)

The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats.

Many class 2 elements take @src and @ref. @src points via ID reference to one or more <source>s and @ref points to one or more <div>s through their flat ref (perhaps substituted with their new values if <rename-div-ns> have been invoked (see the section called “Metadata (<head>)”).

In the example ref = "1.2-4, 1.5", the periods are arbitrary (but the hyphen and comma, which have special meanings here, are not). You may use any punctuation you wish, or even space, but it is recommended you use what will be most familiar to users. You may use non-Arabic numerals, regardless of the numbering system used by your sources.

@chars and @pos follow a useful compact syntax, described below (the section called “@pos and @val”).

`@pos` and `@val`

To point to a token, one of three methods may be used.

@pos alone. Under this method, one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas indicate one or more token numbers. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a sequence of word tokens. The numerical value to which the keyword last resolves depends upon the context of each source and ref.
@val alone. Under this method, a single token is picked by means of a string value equivalent to the token. For example, @val = "bird", points to the first occurence of the token bird.
@pos and @val together. Under this method, specific occurences of a token are picked. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird.

Any time @pos appear in an element, and @val doesn't, @val is assumed to allow matches to any word. Vice versa, if @val appears but @pos doesn't, the latter is assumed to equal 1.

@pos and @val must be used carefully. For example, the attribute combination val="bird" pos="last-5" will produce an error if the word token bird does not occur at least six times.

Prev	Up	Next
Transcriptions Using the Text Encoding Initiative (`<TEI>`)	Home	Alignments: Principles and Assumptions