Text Alignment Network

The Text Alignment Network (TAN) is a suite of XML encoding formats and associated rules and recommended practices, intended to serve anyone who wishes to encode, exchange, and study translations, paraphrases, adaptations, quotations, and other varieties of text reuse.

The XML encoding formats behind TAN have been designed to be maximally readable and editable by both humans and machines, to be useful for any program or tool that wishes to support the TAN format, and to be both syntactically and semantically interoperable. The TAN format is suitable not only for export/import but, in many cases, for native use.

Built upon stand-off annotation, the TAN format allows scholars to edit and study the same texts independently and collaboratively. TAN schemas are written not only to ensure validation, but to provide feedback and assistance to anyone editing and correcting TAN files. Because the format has well-defined rules, it comes with a library of functions that definitively interpret the format and pre-process the files, thereby assisting developers creating TAN-based tools and applications.

TAN is meant primarily to support research in linguistics, translation studies, and the humanities—any scholarly field concerned with the interpretation of textual reuse or versions of texts. The format is designed to handle any text written in a Unicode-supported writing system, and from any period, from antiquity to the recent past.

Although expressive of scholarly nuance and complexity, TAN files are meant to benefit everyone, to be used in broad applications such as multilingual publishing, language learning, and machine translation.


The rationale for TAN can be summarized:

  1. There are many methods, but none standard and few shared.
  2. Those methods that are shared do not allow interoperability. That is, if two versions of a single work are encoded independently using the same method, they are unlikely to be interchangable without some human intervention. And that intervention must be repeated for every new pair of texts to be aligned.
  3. Existing methods enforce or allow varying degrees of syntactic interoperability (markup syntax and structure) and little if any semantic interoperability (markup meaning).

These claims are justified by the list below, which provides a diagnostic sample of ways to align two or more texts, taken from projects and tools in digital humanities and computational linguistics. This spreadsheet is public, and comments are welcome.

Is it not time for textual scholars to use a format that maximizes the chance that our texts will be semantically and syntactically interoperable?


Being under active development, TAN schemas (RELAX NG, schematron), documentation, and examples are unstable, and not yet ready for widespread public release. For those with a background in XML technology, materials under development are available on this website and at GitHub.

As TAN is tested and revised, demonstration outputs will be released on occasion, to show practical uses of the format: http://textalign.net/output. Some important caveats:


Active participants in testing, using, and developing the Text Alignment Network are desired. Our purpose is to develop and maintain: a suite of validation files; guidelines documenting the rationale and best use of those validation files; and select examples, tutorials, and tools to assist TAN editors and users. Inquiries about participation should be sent to the project manager, Joel Kalvesmaki, by email: kalvesmaki at gmail.com.

Official announcements are made by email (Google Group) and by Twitter

