Chapter 6. Class-2 TAN files, annotations of texts

Chapter 6. Class-2 TAN files, annotations of texts
Prev	Part II. Detailed description	Next

This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see Chapter 12, TAN patterns, elements, and attributes defined.

There are three types of class-2 files:

TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also supports a wide variety of annotations on texts.
TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts.
TAN-A-lm files express annotations pertaining to lexico-morphology (grammatical part-of-speech), for either a single class-1 file or a language in general.

In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.

Common elements

Class 2 metadata (`<head>`)

Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below.

All class-2 files have as their sources nothing other than class-1 files. Therefore each <source> must take the the section called “Digital entity metadata pattern”.

Editors of class-2 files must be able to name or number word-tokens in a transcription, and to determine an appropriate definition of "token," via an optional <token-definition>. See the section called “Defining words and tokens”.

Inevitably, some class 1 sources for the same work will differ from each other. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations may be specified in <adjustments>, which stipulates a set of actions that should be applied to the sources that have been named. The following adjustment actions are supported:

<skip>, to allow you to ignore specific <div>s, deeply or shallowly.
<rename>, to allow you to rename specific <div>s.
<equate>, to allow you to provisionally establish some @n values as being synonymous.
<reassign>, to allow you to split leaf <div>s and move their parts elsewhere in the structure.

These adjustment actions allow you to reconcile discordant sources without changing them directly.

Skips, renames, and equates are first applied to the source as received. If a particular source <div> is the target of more than one adjustment action, only the first one will be applied according to action priority: <skip>, <rename> based on @ref, <rename> based on @n, then <equate>. This action priority also corresponds to the amount of time needed to process the adjustments. Numerous <skip> actions are applied very quickly. Numerous <reassign>s however can be time-consuming, because it requires tokenizing the text.

Because of this priority order, some actions might not be performed. For example, if you deeply skip a <div>, no renaming adjustments will be made to its children.

Skips, renames, and equates are applied in one pass, based on the original reference system, then <reassign>s are applied to the the newly adjusted source. If you rename a div, then want to reassign it, you must do so based on the new name, not the original.

Each adjustment action adds time to the validation routines. On lengthy texts these can become quite time-consuming. Take, for example, the Tanakh / Old Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of these differs from the other in the names of books, and the numeration of some chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To completely reconcile these three versions requires at least 1 <skip>, 237 <rename>s and 3 <equate>s, and 31 <reassign>s. Applying these actions to all three versions can take about two minutes (tested on computer with an Intel i5-8250U, 12 GB ram), before any other significant validation checks on anything insed the <body> of the class-2 file.^[17] If such processing times are unacceptable, you are advised to keep <adjustments>s to a minimum or to apply them to relatively small texts.

Further, adjustment actions were intended primarily to address common irregularities between files, to apply some last minute touches, or perhaps to drop certain parts of texts. Adjustments were not designed to provide extensive, deep corrections. If a source must be changed in numerous places to reconcile it with other sources, you should create a new version of the source, reorganized as you prefer. Then in both the new and original versions of the class-1 files insert <redivision>, <predecessor>, <successor>, or <see-also> to link the two versions.

There is a TAN application that remodels one text in the image of another. See applications/remodel/remodel text.xsl. The output of that application requires editing, but it can reduce the amount of work required. TAN tools for Oxygen's author mode can also be used to correct that newly segmented text.

Class 2 data (`<body>`)

Data types differ greatly between the class 2 formats. However, they all share one thing in common: the <body> consists of a series of claims, and responsibility for those claims should be attributed to the persons, organizations, or algorithms making the claims. Therefore, each <body> may take @claimant and perhaps @claim-when, specifying by IDref who should be credited or blamed with the material. If either attribute is missing, it is assumed that the claims are the responsibility of the persons listed in <file-resp>. The values of @claimant and @claim-when are weakly inheritable.

Class 2 pointer syntax: referencing texts

The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'."

Those familiar conventions are the basis for the TAN pointer syntax, which differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers apply common reference terminology to four strata of a text: works, divisions, word tokens, and characters. Works, defined above (see the section called “One Work”), are declared by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of the text in those divisions, defined according to one or more <token-definition>s declared in the class-2 file. And characters are defined as individual base letters in a word token (any modifier character is treated in concert with the last preceding base character; see the section called “Combining characters”).

This approach not only makes the syntax human readable but mitigates the effect of changes to the sources. For example, if a <div> is deleted, moved, or changed, the alteration affects only references specific to that <div> and its descendants; the rest of the reference system remains intact.

The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or study TAN examples, to see how they are used in practice.

Referencing works: `@work`

This section applies only to TAN-A files, because the other class-2 files do not make claims about works per se.

TAN-A files refer to works via meaningful IDrefs that point to the class-1 sources that transcribe the work/work-version, e.g., work="hamlet". The reference is understood to apply not merely to that particular source, but to any TAN-T file that claims to transcribe that work or work-version. (On the relationship between works and work-versions see the section called “Domain model”.) Thus, the id of the source-scriptum becomes a proxy or alias for the work.

Any work may also be defined through a vocabulary item <work>, either locally in the <vocabulary-key> or in a TAN-voc file linked via <vocabulary>. The work would then be referred to by @xml:id or <name> of the particular vocabulary item.

Referencing textual divisions: `@ref`

Portions of text, i.e., <div>s, perhaps altered if <adjustments>s have been invoked (see the section called “Metadata (<head>)”, are pointed to via @ref. A @ref is constructed by taking the values of @n in the <div> in question along with its ancestor <div>s, and joining them with non-word characters. For example, @ref="I.1.1" might point to the following:

<div type="act" n="1">
   <div type="scene" n="1">
      <div type="line" n="1">
         . . . . . .
      </div>
      . . . . . .
   </div>
   . . . . . .
</div>

A @ref can express sequences and ranges of <div>s. In the example ref="1.2-4, 1.5", the hyphen and comma, which are reserved to signify ranges and series, are reserved. A hyphen always means "from...through" and a comma always means "and". In the TAN format, commas are always paratactic, not hypotactic. For example, if referring to Hamlet, ref="I,2,3" is not a single reference to <div>, act I scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how the commas in the attribute value behave like the commas in the written phrase). If you mean to say act I, scene 2, line 3 try ref="I.2.3" or ref="I 2 3".

The periods (full stops) in @ref="I.1.1" are hypotactic markers, but they are arbitrary, and could be replaced with any mix of non-word character you like (except the hyphen or comma), including spaces, e.g., ref="I:1 1". The numeral system is also arbitrary. You may use any supported numeration system (see section on numeration systems), even if the source uses a different one. Semantic equivalents to the preceding example are ref="A I i" and ref="1:a:I". Just remember, if you use either the Roman numeral system or alphabetic sequences, include a <numerals> in the <head> to specify which system should prevail in case of ambiguities (e.g., whether c means 3 or 100). Roman numerals are the default, but it is a good idea to be explicit.

Referencing tokens: `@pos` and `@val`

To point to a token one normally uses <tok>, with one or more attributes, in three possible configurations:

@val or @rgx alone: one or more tokens are pointed to by value. For example, val = "bird", points to every occurence of the token bird; rgx = "b.+d" finds every word that begins with a b, ends with a d, and has some characters in-between. Every value of @rgx is implicitly bound to the beginning and end of the string (see below).
@pos alone: one or more tokens are pointed to by numerical position, via one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a passage. The numerical value to which the keyword last resolves depends upon the context length.
@val or @rgx combined with @pos: a combination of the previous two methods. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird.

During Schematron validation, if @pos is missing, it is assumed to mean * or 1 - last; if neither @val nor @rgx appear, the assumption is @rgx with value .+ (any characters). That is, by default, @pos points to every instance and @val/@rgx to every token.

When using @pos make sure you know the context. For example, the attribute combination val="bird" pos="last-1" will produce an error if the token bird does not occur at least two times in the given context.

It is advisable to use @val or perhaps @rgx, and not merely @pos. If your source's text changes, and there is no @val, it may be difficult to determine the original intent of a claim, to determine whether changes need to be made. @val is easier than @rgx to process in applications, particularly when compiling statistics or estimating probabilities. Furthermore, @val is generally speaking more efficient to process than is @rgx. A @rgx is more efficient only if it replaces numerous instances of @val.

@rgx is a regular expression that must match an entire word-token. For example, @rgx="re.d" will match the tokens "rend" and "read" but will not match "already", "rends", or "bread". If you wish to allow for characters at the beginning or end, use ".*re.d.*". For more on regular expressions, see the section called “Regular expressions”.

Referencing characters: `@chars`

Individual letters are always specified by @chars, which points to a specific position, e.g., chars="2, 7, last". Combining characters are excluded from these counts; see the section called “Combining characters”.

^[17]In earlier generations of TAN, this process took upwards of an hour.

Prev	Up	Next
Transcriptions using the Text Encoding Initiative (`<TEI>`)	Home	General annotations and alignments (<TAN-A>)