Scholars working with texts frequently need to study numerous versions. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions.
Such textual comparison requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires not engagement with other works and other languages, requiring collaboration across projects and fields of study.
The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is actually a suite of formats, built modularly, with each format designed to allow an editor to focus exclusively on a single set of tasks. The format encourages or requires editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications.
TAN has been designed to support two kinds of scholarly activity: creation and research.
When we create our primary sources or analyses of them, we normally want what we create to be useful to our colleagues. TAN was designed to augment the utility of such creative scholarly activities as:
Creating and sharing a transcription of a particular version of a textual work such that it is most likely to align with any other TAN version of that text created by someone else;
Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works;
Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even when there may be messy overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed;
Listing the lexicomorphogical features of each word in a text or a language such that the linguistic data has meaning above and beyond a particular coding scheme, and can be collated with lexicomorphological data for other languages.
TAN files that are published and shared produce a decentralized but interoperably corpus of texts. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:
For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones?
How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts?
How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic?
Can an anonymous English translation of a modern French novel be identified with known translators of French novels from the same period?
How do present-day translations of official United Nations documents differ across languages?
This is not to say that the TAN format, in itself, it answers such questions. It merely lays a framework within which such questions can be investigated. Some other caveats:
Although TAN comes with an extensive library of functions and templates, it is not a tool per se. It does not provide software or applications to create, edit, or display TAN-compliant files, nor does it dictate how such tools should behave. Rather, it allows you or a developer (especially an XML developer) to create customized applications and tools.
The TAN formats are specialized. They supplement, and does not replace, other common text formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Converting from TAN into these formats is usually straightforward, but will normally entail loss. On the other hand, converting from one of these formats into TAN normally cannot be completely automated, the TAN format has scholarly expectations that are not required in the other formats. Conversion must be given careful thought.
TAN has a restricted field of inquiry (defined and explained in these guidelines). The format is not suitable for many lines of iniquiry, e.g., representing how a text was displayed in a particular edition.
TAN has been designed to serve those who prioritize legibility and readability over computational efficiency. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or enormous files.