Format Organization

Format Organization
Prev	Chapter 3. General Underpinnings	Next

The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: transcriptions (class 1), annotations and alignments of transcriptions (class 2), and everything else (class 3).

Class 1, representations of textual objects, consists solely of transcription files. Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum), whether physical or digital. There are two types of transcription file: a standard generic format and a TEI extension. These two types are differentiated by the root element, <TAN-T> and <TEI> respectively.

Class 2, annotations of class 1 files, are used to encode claims about texts, and to align them. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A-div> as the root element, aligns any number (one or more) of class 1 files, and permits assorted claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class 1 files. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part of speech) forms of individual words in a single class 1 file.

Class 3, covers everything else. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for tagging words in a dependent TAN-A-lm file. <TAN-key> collects and defines terms frequently used in other TAN files. <collection> marks TAN catalog files, which provide an index of locally available TAN files.

This modular approach supports what is sometimes called stand-off annotation (or stand-off markup), in contrast to in-line annotation, in which a text and its annotations are placed in a single file. (Most TEI and HTML files feature in-line annotation.) In stand-off annotation, the annotations reside in files separate from the text. This provides several benefits:

An editor can work on a file with minimal distraction, focusing on a limited set of closely related questions.
Editors can work off the same master files, even if they have very different research interests.
Complementary or competing annotations can be made, even if those annotations overlap (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another).
TAN files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project.
Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently, on separate files.
When errors are found, they can be corrected in master files. Anyone depending upon that master file as a source will be notified of changes that have been made and they can deal with them accordingly. (Editor 1 can post typographical corrections, and if she logs the change with a time-date stamp, anyone using the file, upon validating their files, will be sent information or a warning about the change. Similarly, Editors 2 and 4 can let Editor 1 know about their work, and Editor 1 can update the Old French versions with cross-references.)
Any data file can be released, circulated, and used independent of any other that points to it, or to which it points.
Connected files can be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis). A transformation created for one set of TAN documents will work identically on other TAN documents of the same format. (If someone creates a tool to synthesize a transcription and an associated TAN-A-lm file, it can be applied to both Editor 2's and Editor 4's work.)
The TAN family of formats can be expanded to allow other types of linguistic data, and therefore other lines of research.

Stand-off annotation is not without liabilities. Files might be altered or altogether deleted, rendering dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are significant challenges, but TAN validation rules have been designed to mitigate these as much as possible.