Rationale and Purpose

Scholars working with texts frequently need to study numerous versions. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions.

Such textual comparison requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires not engagement with other works and other languages, requiring collaboration across projects and fields of study.

The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is actually a suite of formats, built modularly, with each format designed to allow an editor to focus exclusively on a single set of tasks. The format encourages or requires editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications.

TAN has been designed to support two kinds of scholarly activity: creation and research.

When we create our primary sources or analyses of them, we normally want what we create to be useful to our colleagues. TAN was designed to augment the utility of such creative scholarly activities as:

TAN files that are published and shared produce a decentralized but interoperably corpus of texts. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:

This is not to say that the TAN format, in itself, it answers such questions. It merely lays a framework within which such questions can be investigated. Some other caveats: