Token-Based Annotations and Alignments (<TAN-A-tok>)

TAN-A-tok files provide a microscopic view of how two sources relate to each other. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions.

Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. They should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div files are because they reflect work that is more detailed, and therefore more time-consuming, than simple en masse alignment of sources.

Because of the detailed nature of the inquiry, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. So annotating a bitext is oftentimes not easy, and requires you to think hard about assumptions you have made in two key areas: the relationship that holds between two scripta and the types of reuse that was involved in turning one version into the other (or a common ancestor into both).

Relationship of sources' scripta. What is the the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, what common ancestor do they share? Here you consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's.

Types of reuse. What categories of text reuse do you consider operative? Such a declaration tells users of your data what paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using loosely defined concepts such as translation, paraphrase, quotation, and so forth without much specificity. On the other hand, you may subscribe to a detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See the section called “TAN keywords for types of bitext reuse (<reuse-type>)”.

The root element of a token-based alignment file is <TAN-A-tok>.

The TAN-A-tok header builds upon the core and class 2 headers (see the section called “Metadata (<head>)” and the section called “Class 2 Metadata (<head>)”).

TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id.

<definitions> takes, in addition to all the elements allowed in class 2 files (see the section called “Class 2 Metadata (<head>)”), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scripta. The second attends to the qualitative aspect of the bitext relationship.

The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see @in-progress and the section called “Edit Stamp”), required @bitext-relation and @reuse-type, which take one or more id references from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment.

<body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, to facilitate external discussions about an assertion.

Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens in either source do not represent gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens.

If you wish to declare that one or more words in one source were left out of a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a half-null alignment, i.e., a token cluster that has tokens from only one source. A half-null alignment implies insertions or omissions.

A fully aligned bitext may result in a TAN-A-tok file with a very long <body> (in contrast to the typical TAN-A-div file). That does not mean, however, that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit you to saying everything possible about the bitext. You might choose to encode only a few token clusters.

If there are multiple IDs in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="trans para" would indicate that the token cluster results from a combination of translation and paraphrase. If you wish to claim that the token cluster might be a translation or it might be a paraphrase, then you should create two separate <align>s, and add @cert.