Scholars working with texts frequently need to work with numerous versions. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined by generations and cultures who created, read, and circulated the versions.
Such textual comparison requires texts whose words, sentences, paragraphs, and other segments are aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation and its original correspond to each other might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires engagement with other works and other languages, and collaboration across projects and fields of study.
Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of texts and annotations on those texts. TAN syntax is suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. TAN is not a single format, but rather a suite of formats, built modularly. Each format is dedicated to a particular task, requiring editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (whether human or computer) can decide whether the data meets their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications (see the section called “Resource Description Framework (RDF) and Linked Open Data”).
TAN has been designed to support two kinds of scholarly activity: creation and research.
When we create our primary sources or analyze them, we normally want what we create to be useful to our colleagues. TAN was designed to assist scholarly creative activities such as:
Creating and sharing a transcription of a particular version of a textual work that it is more likely to align with any other TAN version of that text created by someone else;
Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works;
Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even with overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed;
Listing the grammatical features of every word in a text or a language in a way that allows it to be compared easily against other languages and texts.
Shared TAN files form a decentralized, interoperable corpus of texts, a kind of Internet of primary sources and annotations. As this TAN-compliant corpus spreads into different linguistic, chronological, and geographical regions, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as:
For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones?
How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts?
How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic?
Can an anonymous English translation of a modern French novel be identified with known translators from that period?
How do present-day translations of official United Nations documents differ across languages?
Neither the TAN format nor its applications answer such questions. But they can be used to start to answer such questions.
TAN differs from other text formats such as HTML, Microsoft Word, PDF, or Docbook. Each of those formats are interoperable only in the sense that any file can be reliably opened and displayed by the same software. Despite such software compatibility, the content, structured by each user, looks very different from one file to the next. If you receive from different people two versions of a particular literary work in the same formet, there would be little likelihood that you could align them without a lot of extra work. These are presentation formats, designed to let the creator use his or her imagination to shape, structure, and present the material in highly stylized, creative ways. The formats are laissez faire, concerned mainly to ensure that each component is rendered properly, without regard for the meaning of those components.
Creating a text in TAN is like opening a word processor and telling it, "I don't care how the text looks. I want to ensure that it is in a meaningful structure that corresponds to any other version of that text. The appearance, which could take thousands of directions, can be worried about later." The closest analogue to the TAN formats is the XML format developed by the Text Encoding Initiative, whose design catalyzed and continues to inspire the development of TAN. TAN adopts and extends the TEI validation rules, to make them more rigorous and penetrating, to support cross-project interoperability. One of the TAN formats is modestly customized TEI. (For more on comparisons between TAN and TEI see the section called “The Text Encoding Initiative”.)
Some other caveats:
Although TAN comes with an extensive library of functions and templates, it is not what most people think of as a tool or application. It does not provide a graphic interface to create, edit, or display TAN-compliant files, nor does it dictate how such tools should behave. Rather, it allows programmers (especially XML developers) to create customized applications and tools. If you are working with an XML editor like oXygen, your editing experience will be greatly enhanced by the TAN function library.
The TAN formats are specialized. They are not meant to replace other common text formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Converting a TAN file into these formats is usually straightforward, but will usually entail loss. Conversely, most conversions from one of these formats into TAN will not entail loss, but will be imperfect or incomplete, because the TAN format requires data that will be missing, or not easily identifiable. Conversion must be given careful thought, and can only be semiautomated.
Each TAN format has a restricted field of inquiry, defined and explained in these guidelines. TAN is not suitable for unsupported research interests, e.g., marking a transcription to imitate its presentation in a particular print edition.
TAN files are optimized for legibility and readability, and may be inefficient in certain contexts and applications. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or large files. There are work-arounds, explained in the guidelines. Many applications will perform better when TAN files are pre-processed. See the section called “Using TAN outside the Network”.