Rationale and purpose

Different versions of texts—translations, quotations, paraphrases, and so forth—are important sources for scholars. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of the genius or idiosyncrasies of those who translated or quoted the original, which in turn sheds light on how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions.

The comparison of versions of texts requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have chosen a segmentation system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping sets. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a proper study of that context requires not only multiple versions of different works, but collaboration across projects and fields of study.

The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is modular, with each module designed to allow an editor to focus on a single set of tasks (editing, grammatical analysis, word alignment, etc.) without having to worry about tasks handled by other modules. The format compels editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications.

TAN has been designed to support specific research desiderata such as the following:

The last question is especially significant. As TAN files get published, there emerges a kind of Internet of primary sources—a decentralized corpus of texts that "talk" to each other. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, the interoperability of its parts allows the development of third-party tools and applications to expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:

These ambitious research questions should be tempered with some caveats: