Token-Based Alignments (<TAN-A-tok>)

TAN-A-tok files provide a microscopic view of how two sources relate to each other. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions.

Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. You should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div files are because they reflect work that is more detailed, and therefore more time-consuming, than simple en masse alignment of sources.

Because of the detailed nature of the inquiry, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. So alignment of a bitext is oftentimes not easy, and requires you to think hard about assumptions you have made in two key areas: the relationship that holds from one source's scriptum to the other and the types of reuse that was involved in turning one version into the other (or a common ancestor into both).

Relationship of sources' scripta. What is the the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, where is their common ancestor? Here you consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's, because that goes a long way to explaining the relationship that holds between the immaterial texts.

Types of reuse. What categories of text reuse do you hold to? Such a declaration tells users of your data what paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using loosely defined concepts such as translation, paraphrase, quotation, and so forth without offering a specific definition. On the other hand, you may have a specific and detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See the section called “TAN keywords for types of bitext reuse (<reuse-type>)”.

The root element of a token-based alignment file is <TAN-A-tok>.

The TAN-A-tok header builds upon the core and class 2 headers (see the section called “Metadata (<head>)” and the section called “Class 2 Metadata (<head>)”).

TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id.

<declarations> takes, in addition to all the elements allowed in class 2 files (see the section called “Class 2 Metadata (<head>)”), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scriptum. The second attends to the qualitative aspect of the bitext relationship.

The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see @in-progress and the section called “Edit Stamp”), required @bitext-relation and @reuse-type, which take one or more id references from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment.

<body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Each token cluster in a given TAN-A-tok file is valid independent of any other token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, to facilitate external discussions about an assertion.

Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens in either source do not represent gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens.

If you wish to declare that one or more words in one source were left out of a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a half-null alignment, i.e., a token cluster that has tokens from only one source. A half-null alignment corresponds—to draw from the terminology of translation studies—to implicitation or explicitation of entire words or phrases.

A fully aligned bitext may result in a TAN-A-tok file with a very long <body> (in contrast to the typical TAN-A-div file). That does not mean, however, that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit you to saying everything possible about the bitext. You might choose to encode only a few token clusters.

If there are multiple IDs in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="trans para" would indicate that the token cluster results from both translation and paraphrase. If you wish to claim that the token cluster might be a translation or it might be a paraphrase, then you should create two separate alignments, and add @cert.