Format Organization

Format Organization
Prev	Chapter 3. General Underpinnings	Next

The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: texts (class 1), text alignments and annotations (class 2), and everything else (class 3).

Class 1, representations of textual objects, consists solely of transcription files. (See note on transcriptions versus transliterations.) Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum; see the section called “Domain model”), whether physical or digital. There are two types of transcription file: a standard generic format (TAN-T) and a customization of TEI All (TAN-TEI). These two types are differentiated by the root element, <TAN-T> and <TEI> respectively.

Class 2, encode claims about class-1 texts, and align them. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A> as the root element, aligns any number (one or more) of class-1 files, and allows a wide variety of claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class-1 files. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part-of-speech) forms of individual words from a single class-1 file, or of a language in general.

Class 3, covers everything else. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for lexico-morphological codes in dependent TAN-A-lm files. <TAN-voc> collects and labels vocabulary items used in other TAN files. TAN catalog files have the root element <collection>, and they index locally available TAN files, and selective parts of their metadata.

This modular approach relies upon a stand-off approach to annotation or markup. In the alternative method, inline markup, an annotation is inserted directly into a transcription, e.g., <p>He said <quote>"Jump!"</quote></p>, where the inner element <quote> annotates the third word. Most TEI and HTML files rely upon in-line annotation. In stand-off annotation, <p>He said "Jump!"</p> would be left as-is, and somewhere else there would be an annotation that states that the third word is a quotation. If the stand-off annotation is in the same file, it is an internal stand-off annotation. If the annotation is in a different file, it is an external stand-off annotation.

TAN depends upon external stand-off annotation, which provides several benefits:

An editor can focus on a limited set of closely related questions.
A source text without inline annotations is less cluttered, and therefore easier to read, than one with inline annotations.
Editors can work on separate annotation files based upon the same master transcription file, even if they have very different research interests.
Complementary or competing annotations can be made, even in the same file, and those annotations may point to concurrent or overlapping spans of text (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another).
A corpus of stand-off external annotation files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project.
Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently.
When an errors is found in a transcription file, in can be corrected in a single place, in the master. Anyone using a copy of that master file will be notified in the validation process of changes that have been made and they can deal with them accordingly.
Any data file can be updated independent of any other that points to it, or to which it points.
Cross-file links required in stand-off annotation networks files, which can then be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis).

The stand-off approach works toward a principle often valued in computer science, that of the disaggregation of data. That is, in a master format, data should be simple and not entangled with other data. It can later be reaggregated in all kinds of ways, but that is an end product, not the way data should be managed. It is analogous to the way any well-run kitchen keeps its ingredients separate, until it is time to cook or bake a variety of products, at which time a few disaggregated ingredients can be combined in a variety of ways.

Stand-off annotation is not without problems and vulnerabilities. Files might be altered or altogether deleted, rendering pointers in dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are important challenges, but TAN validation rules have been designed to mitigate such problems.