Chapter 6. Class-2 TAN Files, Annotations of Texts

Table of Contents

Common Elements
Class 2 Metadata (<head>)
Class 2 Data Patterns (<body>)
@pos and @val
Division-Based Annotations and Alignments (<TAN-A-div>)
Root Element and Header
Data (<body>)
Token-Based Annotations and Alignments (<TAN-A-tok>)
Root Element and Header
Data (<body>)
Principles and Assumptions
Root Element and Header
Data (<body>)

This chapter provides general background to class 2 TAN files. For detailed discussion of individual elements and attributes see Chapter 8, TAN patterns, elements, and attributes defined.

TAN-A-div files provide broad, macroscopic alignment of multiple versions of any number of works. It also provides a place for annotating the texts through general claims.

TAN-A-tok files provide narrow, microscopic alignment of any two class 1 files, identifying word-for-word or character-for-character correspondence.

TAN-A-lm files support lexico-morphology (part-of-speech) for either a single class 1 file or a language.

In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for cases where it may not be clear which is the target and which is the source. Further, there is a more generic use of source that takes precedent. In these guidelines, therefore, we avoid the term target altogether, and when we use the word source, we are referring only to one of the class 1 files upon which a class 2 alignment depends.

The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'."

Those familiar conventions are the basis for the TAN pointer syntax, and so it differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers depend upon a fourfold hierarchy of: works, divisions, word tokens, and characters. Works, defined above (see the section called “One work”), are defined by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of those divisions, defined according to one or more tokenization rules. And characters are defined as non-modifying codepoints in a word token. (A modifying character is always included with the base character it modifies.)

Parts of this fourfold hierarchy—works, divisions, tokens, and characters—normally have familiar names. Sources can be given a meaningful abbreviated name (e.g., xml:id = "hamlet-1741"); divisions are named according to @n; tokens are referred to by position, by their actual values, or both (e.g., pos = "1 - 5", pos = "last-1 - last", val = "hath"; see the section called “@pos and @val”). Characters are always identified by number (e.g., chars = "2, 7").

This approach not only makes the syntax human readable, it also mitigates disruptions from corrections to the dependencies. For example, if an incorrectly duplicated <div> is deleted, disruption to the reference system is isolated and does not affect the rest of the document.

Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above.

All class 2 files have as their sources nothing other than class 1 files. Therefore each <source> must take the the section called “Digital Entity Metadata Pattern”.

Editors of class 2 files must be able to name or number word-tokens in a transcription, via an optional <token-definition>. See the section called “Defining Words and Tokens”.

Inevitably, some class 1 sources will have differences. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations are specified in <alter>, which stipulates a set of actions that should be applied to the sources that have been named. Alteration actions include:

These actions allow you to reconcile sources that are somewhat at odds. Actions are applied first hierarchically and then in the sequence stated above. That is, the validation routine will go level by level through a given source. Any rules that are found in one level will be applied (skips taking top precedence, reassigns the lowest) before moving to the next level of the source. So if you wish in a given source to change chapter 1 to chapter 2, any subdivisions will be collated. If you wanted to do further things with (original) 1.5, you would need to refer to it as 2.5, and you would also need to realize that if original 2.5 exists, the action will be applied to both.

Each action adds time to the validation routines. On lengthy texts these can become quite time-consuming. You are advised to keep <alter>s to a minimum. If a source has numerous alterations, you find it less time-consimung to create a new version of a source.

The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats.

Many class 2 elements take @src and @ref. @src points via ID reference to one or more <source>s and @ref points to one or more <div>s through their flat ref, perhaps substituted with their new values if <alter>s have been invoked (see the section called “Metadata (<head>)”.

In the example ref = "1.2-4, 1.5", the periods are arbitrary (but the hyphen and comma, which have special meanings here, are not). You may use any separating punctuation or space you wish, except for hyphens and commas, which are reserved to create ranges and joins. You may also use other numeral systems.

To point to a token, one of three methods may be used.

  1. @pos alone. Under this method, one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas indicate one or more token numbers. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in passage. The numerical value to which the keyword last resolves depends upon the length of each <div>.

  2. @val alone. Under this method, a single token is picked by means of a string value equivalent to the token. For example, @val = "bird", points to the first occurence of the token bird.

  3. @pos and @val together. Under this method, specific occurences of a token are picked. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird.

During validation, if @pos or @val are missing, they are supplied with their default values, 1 and .+ respectively. That is, @pos by default points to the first instance and @val by default points to any string.

@pos and @val must be used carefully. For example, the attribute combination val="bird" pos="last-5" will produce an error if the word token bird does not occur at least six times.

It is advisable to use @val, and not merely @pos. If the editor makes corrections to your source texts, references are more likely to become corrupt, and less likely to be traceable, if there is no @val .