Scholars frequently work with numerous versions of texts. Sometimes the original version has been lost, or survives only fragmentarily, and can be studied only through later translations, paraphrases, or quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined by generations and cultures who created, read, and circulated the versions.
Such textual comparison requires texts whose words, sentences, paragraphs, and other segments are aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation and its original correspond might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires engagement with other works and other languages, and collaboration across projects and fields of study.
Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of texts and annotations on those texts. TAN syntax is suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. TAN is not a single format, but rather a suite of formats, one task per format. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications (see the section called “Resource Description Framework (RDF) and Linked Open Data”).
TAN has been designed to support two kinds of scholarly activity: creation and research.
When we create our primary sources or analyze them, we normally want what we create to be useful to our colleagues. TAN was designed to assist scholarly creative activities such as:
Creating and sharing a transcription of a particular version of a textual work that it is more likely to align with any other TAN version of that text created by someone else;
Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works;
Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even with overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed;
Listing the grammatical features of every word in a text or a language in a way that allows it to be compared easily against other languages and texts.
Shared TAN files form a decentralized, interoperable corpus of texts, a kind of Internet of primary sources and annotations. As this TAN-compliant corpus spreads into different linguistic, chronological, and geographical regions, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:
For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones?
How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts?
How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic?
Can an anonymous English translation of a modern French novel be identified with known translators from that period?
How do present-day translations of official United Nations documents differ across languages?
Neither the TAN format nor its applications answer such questions. But they can be used to start to work on answers, because the TAN function library includes many cutting-edge algorithms that cannot be found in other programming libraries, whether XSLT or not. What the Natural Language Toolkit (or the related Classical Language Toolkit) is for digital humanists using Python, TAN aspires to be for those using XSLT. For more on the function library see the section called “Using TAN functions”.