The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: texts (class 1), text alignments and annotations (class 2), and everything else (class 3).
Class 1, representations of textual objects,
consists solely of transcription files. (See note on transcriptions versus
transliterations.) Each transcription file contains the text of a single
work from a single text-bearing object (which we term scriptum;
see the section called “Domain model”), whether physical or digital. There are two
types of transcription file: a standard generic format (TAN-T) and a customization of
TEI All (TAN-TEI). These two types are differentiated by the root element,
<TAN-T>
and
<TEI>
respectively.
Class 2, encode claims about class-1 texts, and
align them. There are two types of alignment, one for broad, general alignments and
another for granular, word-for-word aligments. The former, with <TAN-A>
as the root element, aligns
any number (one or more) of class-1 files, and allows a wide variety of claims about
those files. The latter, <TAN-A-tok>
, aligns only pairs of class-1 files.
Lexico-morphology files, <TAN-A-lm>
, are used to encode the lexical and morphological (or
part-of-speech) forms of individual words from a single class-1 file, or of a
language in general.
Class 3, covers everything else. <TAN-mor>
is used to define the
grammatical categories or features of a given language and to specify rules for
lexico-morphological codes in dependent TAN-A-lm files. <TAN-voc>
collects and labels
vocabulary items used in other TAN files. TAN catalog files have the root element
<collection>
, and
they index locally available TAN files, and selective parts of their metadata.
This modular approach relies upon a stand-off
approach to annotation or markup. In the alternative method, inline markup, an annotation is inserted directly into a
transcription, e.g., <p>He said <quote>"Jump!"</quote></p>
,
where the inner element <quote>
annotates the third word. Most TEI
and HTML files rely upon in-line annotation. In stand-off annotation, <p>He
said "Jump!"</p>
would be left as-is, and somewhere else there would be
an annotation that states that the third word is a quotation. If the stand-off
annotation is in the same file, it is an internal stand-off
annotation. If the annotation is in a different file, it is an external
stand-off annotation.
TAN depends upon external stand-off annotation, which provides several benefits:
An editor can focus on a limited set of closely related questions.
A source text without inline annotations is less cluttered, and therefore easier to read, than one with inline annotations.
Editors can work on separate annotation files based upon the same master transcription file, even if they have very different research interests.
Complementary or competing annotations can be made, even in the same file, and those annotations may point to concurrent or overlapping spans of text (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another).
A corpus of stand-off external annotation files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project.
Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently.
When an errors is found in a transcription file, in can be corrected in a single place, in the master. Anyone using a copy of that master file will be notified in the validation process of changes that have been made and they can deal with them accordingly.
Any data file can be updated independent of any other that points to it, or to which it points.
Cross-file links required in stand-off annotation networks files, which can then be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis).
The stand-off approach works toward a principle often valued in computer science, that of the disaggregation of data. That is, in a master format, data should be simple and not entangled with other data. It can later be reaggregated in all kinds of ways, but that is an end product, not the way data should be managed. It is analogous to the way any well-run kitchen keeps its ingredients separate, until it is time to cook or bake a variety of products, at which time a few disaggregated ingredients can be combined in a variety of ways.
Stand-off annotation is not without problems and vulnerabilities. Files might be altered or altogether deleted, rendering pointers in dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are important challenges, but TAN validation rules have been designed to mitigate such problems.