Format Organization

The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: transcriptions (class 1), annotations and alignments of transcriptions (class 2), and everything else (class 3).

Class 1, representations of textual objects, consists solely of transcription files. Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum), whether physical or digital. There are two types of transcription file: a standard generic format and a TEI extension. These two types are differentiated by the root element, <TAN-T> and <TEI> respectively.

Class 2, annotations of class 1 files, are used to encode claims about texts, and to align them. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A-div> as the root element, aligns any number (one or more) of class 1 files, and permits assorted claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class 1 files. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part of speech) forms of individual words in a single class 1 file.

Class 3, covers everything else. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for tagging words in a dependent TAN-A-lm file. <TAN-key> collects and defines terms frequently used in other TAN files. <collection> marks TAN catalog files, which provide an index of locally available TAN files.

This modular approach supports what is sometimes called stand-off annotation (or stand-off markup), in contrast to in-line annotation, in which a text and its annotations are placed in a single file. (Most TEI and HTML files feature in-line annotation.) In stand-off annotation, the annotations reside in files separate from the text. This provides several benefits:

Stand-off annotation is not without liabilities. Files might be altered or altogether deleted, rendering dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are significant challenges, but TAN validation rules have been designed to mitigate these as much as possible.