Different versions of texts—translations, quotations, paraphrases, and so forth—are important sources for scholars. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of the genius or idiosyncrasies of those who translated or quoted the original, which in turn sheds light on how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions.
The comparison of versions of texts requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have chosen a segmentation system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a proper study of that context requires not only multiple versions of different works, but collaboration across projects and fields of study.
The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is modular, with each module designed to allow an editor to focus on a single set of tasks without having to worry about other related but separable ones. The format encourages or requires editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications.
TAN has been designed to support specific research desiderata such as the following:
I want to share the transcription of a particular version of a textual work. How do I encode it such that it is most likely to align with any other version of that text created by someone else?
I have an index of quotations I wish to make available. How do I encode it such that the data is semantically rich and can be applied to other, perhaps unknown versions of the same work?
How do I align multiple versions of a single work when those versions may not match very well, or when the reason for alignment may be vague or ambiguous?
How do I publish a word-for-word analysis of a source and its translation, when there may be messy overlapping or ambiguous relationships, and where I might need to express doubt or alternative possibilities of alignment?
How do I publish a dataset that lists passages in two or more works that share a common feature, such as verbatim text or a parallel topic?
How can I share my data with others, and notify or warn them when I make corrections or changes to the master version?
The last question is especially significant. As TAN files are published, there emerges a web of primary sources—a decentralized corpus of texts that "talk" to each other. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, the interoperability of its parts allows the development of third-party tools and applications to expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:
For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones?
How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts?
How do Arabic translations of Greek texts from the Abbasid period differ from those of Sanskrit?
Can an anonymous English translation of a modern French novel be identified with known translators of French novels from the same period?
How do present-day translations of official United Nations documents differ across languages?
Optimism that TAN could be used to address such research questions should be tempered:
Although TAN comes with an extensive library of functions and templates, it is not a tool per se. It does not provide software or applications to create, edit, or display TAN-compliant files, nor does it dictate the behavior of such tools.
TAN does not on its own create alignments or answer research questions. It merely lays a framework within which such questions can be investigated.
TAN has a restricted field of inquiry (defined and explained in these guidelines). The format is not suitable for many lines of iniquiry, e.g., reconstructing the format of an original book or article.
TAN is just one of many formats for texts. It supplements, and does not replace, other common markup formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Conversion to and from TAN to these formats is usually straightforward, but may not be lossless, and should be given some thoughtful planning.
TAN has not been designed to prioritize computational efficiency. It sacrifices repetition and explicitness in favor of terseness and human readability. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or enormous files. This choice has been made upon the principle that users of the format prioritize quality and readibility over speed.