Token-Based Annotations and Alignments (<TAN-A-tok>)

Token-Based Annotations and Alignments (<TAN-A-tok>)
Prev	Chapter 6. Class-2 TAN Files, Annotations of Texts	Next

Token-Based Annotations and Alignments (`<TAN-A-tok>`)

TAN-A-tok files facilitate the microscopic alignment of two related sources. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions. The two class-1 sources should be two different versions of the same work. Most often, one will be a translation of the other, but the format can be used for two versions of the text in the same language, e.g., paraphrase, revision.

Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. They should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than are TAN-A files because of the level of detail involved.

To simplify the file, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. Annotating a bitext is oftentimes not easy, and requires you to consider and declare assumptions you have made in two key areas: the relationship that holds between two scripta and the types of reuse that was involved in turning one version into the other (or a common ancestor into both).

Relationship of sources' scripta. What is the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, what common ancestor do they share? Here you should consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's. See the section called “TAN keywords for types of bitext relations (<bitext-relation>)”.

Types of reuse. What categories of text reuse do you consider operative? Users of your data should be informed of the paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using generic terms such as translation, paraphrase, quotation, without much specificity. On the other hand, you may subscribe to a detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See the section called “TAN keywords for types of bitext reuse (<reuse-type>)”.

Root Element and Header

The root element of a token-based alignment file is <TAN-A-tok>.

The TAN-A-tok header builds upon the core and class 2 headers (see the section called “Metadata (<head>)” and the section called “Class 2 Metadata (<head>)”).

TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id.

<vocabulary-key> takes, in addition to all the elements allowed in class-2 files (see the section called “Class 2 Metadata (<head>)”), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scripta. The second attends to the qualitative aspect of the bitext relationship. See above.

Data (`<body>`)

The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see the section called “Edit Stamp”), required @bitext-relation and @reuse-type, which take one or more IDrefs from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment.

<body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, in case you or someone else wishes to refer to a particular <align>.

Nothing should be inferred from silence in a TAN-A-tok file. There is no requirement that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit yourself to saying everything possible about the bitext. You might choose to encode only a few token clusters. Tokens that are not referred to should not be interpreted as gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens. (See discussion on comprehensiveness.) In fact it is oftentimes preferable to have a TAN-A-tok file that points to only a selection of tokens; a file with tens of thousands of <align>s could take a very long time to validate.

Any token may be the object of as many <align>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives.

If you wish to declare that one or more words in a source were omitted from a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a one-sided alignment, i.e., a token cluster that has tokens from only one source. A one-sided alignment implies insertions or omissions.

If there are multiple values in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="translation paraphrase" would indicate that the token cluster results from an activity that is both translation and paraphrase.

Commonly, <tok>s include @ref, pointing to a leaf <div>. But this is not required. The @ref may point to a <div> that takes other <div>s, or @ref may be altogether absent.

Prev	Up	Next
Division-Based Annotations and Alignments (<TAN-A>)	Home	Lexico-Morphology (<TAN-A-lm>)