Format organization

Format organization
Prev	Chapter 3. General underpinnings	Next

The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: texts (class 1), text alignments and annotations (class 2), and everything else (class 3).

Class 1: representations of textual objects, i.e., transcriptions. (See note on transcriptions versus transliterations.) Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum; see the section called “Domain model”), whether physical or digital. There are two types of transcription file: a standard generic format (TAN-T) and a gentle customization of TEI All (TAN-TEI). These two types are differentiated by the root element, <TAN-T> and <TEI> respectively.

Class 2: annotations on class-1 texts, and alignment declarations. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A> as the root element, aligns any number (one or more) of class-1 files, and allows one to annotate those files. The latter, <TAN-A-tok>, aligns only pairs of class-1 files, on a word-for-word basis. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part-of-speech) forms of individual words from a single class-1 file, or of a language in general.

Class 3: everything else. <TAN-voc> collects and labels vocabulary items used in other TAN files. TAN catalog files have the root element <collection>, and they index locally available TAN files, and selective parts of their metadata. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for lexico-morphological codes in dependent TAN-A-lm files.

TAN adopts a stand-off approach to annotation or markup. In the alternative method, inline markup, which you may be familiar with from TEI or HTML, an annotation is applied directly to the base text, e.g., <p>He said <quote>"Jump!"</quote></p>, where the inner element <quote> annotates the third word.

In stand-off annotation, however, <p>He said "Jump!"</p> would be left as-is, and somewhere else there would be an annotation that states that the third word is a quotation. If the stand-off annotation is in the same file, it is an internal stand-off annotation. If the annotation is in a different file, it is an external stand-off annotation.

For many common, simple cases, inline annotation is simple, convenient, and straightforward. But as inline annotations are added, the benefits slowly diminish. When parts of a file attract multiple markup elements, the file can become difficult to read and navigate.

Stand-off annotation provides several benefits:

An editor can focus on a limited set of closely related questions.
A source text without inline annotations is less cluttered, and therefore easier to read, than one with inline annotations.
Editors can work on separate annotation files based upon the same master transcription file, even if they have very different research interests.
A single annotation refer to two or more texts (e.g., identification of quotations), and not have to prioritize, or be located in any single one.
Complementary or competing annotations can be made, and those annotations may point to concurrent or overlapping spans of text (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another).
A corpus of stand-off external annotation files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project.
Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently.
When an error is found in a transcription file, it can be corrected in a single place, in the master. Anyone using a copy of that master file will be notified in the validation process of changes that have been made and they can deal with them accordingly.
Any data file can be updated independent of any other that points to it, or to which it points.
Cross-file links required in stand-off annotation networks files, which can then be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis).

The stand-off approach works toward a principle often valued in computer science, that of the disaggregation of data. That is, in a master format, data of a particular type should not be entangled with other types of data. It can later be reaggregated in all kinds of ways, but that is an end product, not the way master data should be stored and managed. It is analogous to the way any well-run kitchen keeps its ingredients separate, until it is time to cook or bake a variety of products. We keep separate our flour, eggs, sugar, and so forth, until we find out what a recipe calls for, at which point we combined those ingredients in a variety of ways. It would be terrible if you were asked to make muesli (or granola), and found that someone had already turned the ingredients you wanted into a cake!

Stand-off annotation is not without problems and vulnerabilities. For example:

When (not if) the base text changes, the editor is unaware of how that change will affect any stand-off annotations.
Not having the annotated text and an the annotation in the same reading space can be an inconvenience.
When searching for, or querying, the base text, standoff annotations can be difficult or impossible to incorporate to refine a selection.
When using the material for other purposes, it can be cumbersome or challenging to reintegrate annotations with the base text.
Linking an annotation to its base text requires extra work and maintenance. Normally this involves building and administering a library of identifiers. Adding and removing ids, or checking them for errors, can be time-consuming and confusing.

These are important challenges, but TAN validation rules have been designed to mitigate such problems. The last problem listed above is perhaps the greatest barrier to stand-off annotation. TAN approaches pointing in a much different way that is closer to current scholarly habits. See the section called “Class 2 pointer syntax: referencing texts”.

Furthermore, TEI inline annotations are supported. In general, you are encouraged to use TEI inline annotations where they are simple and make sense. But when the markup accumulates, threatens to create overlapping structures, or pose other difficulties, TAN class 2 files can be an ideal way to build and curate annotations.