Chapter 6. Class-2 TAN Files, Annotations of Texts

Chapter 6. Class-2 TAN Files, Annotations of Texts
Prev	Part II. Detailed Description	Next

This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see Chapter 9, TAN patterns, elements, and attributes defined.

There are three types of class-2 files:

TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also allows annotations of texts, in the form of claims.
TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts.
TAN-A-lm files express annotations pertaining to lexico-morphology (part-of-speech), for either a single class-1 file or a language in general.

In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.

Common Elements

Class 2 Metadata (`<head>`)

Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below.

All class-2 files have as their sources nothing other than class-1 files. Therefore each <source> must take the the section called “Digital Entity Metadata Pattern”.

Editors of class-2 files must be able to name or number word-tokens in a transcription, and to determine an appropriate definition of "token," via an optional <token-definition>. See the section called “Defining Words and Tokens”.

The declaration <numerals> at present does not allow you to customize a numeration system for sources. A future release of TAN may support such a feature.

Inevitably, some class 1 sources for the same work will differ from each other. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations may be specified in <adjustments>, which stipulates a set of actions that should be applied to the sources that have been named. Adjustment actions:

<skip>, to allow you to ignore specific <div>s, deeply or shallowly.
<rename>, to allow you to rename specific <div>s.
<equate>, to allow you to provide synonyms for @n values.
<reassign>, to allow you to split leaf <div>s and move their parts elsewhere in the structure.

These adjustment actions allow you to reconcile discordant sources without changing them directly.

Skips, renames, and equates are first applied to the source as received. If a particular source <div> is the target of more than one adjustment action, only the first one will be applied according to action priority: <skip>, <rename> based on @ref, <rename> based on @n, then <equate>. Because of this priority order, some actions might not be performed. For example, if you deeply skip a <div>, no renaming adjustments will be made to its children. If you have renamed a div, then want to reassign it, you must do so based on the new name, not the original. You should be aware of the consequences of your adjustments.

After skips, renames, and equates are applied, <reassign>s are applied to the the newly adjusted source.

Each adjustment action adds time to the validation routines. On lengthy texts these can become quite time-consuming. Take, for example, the Tanakh / Old Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of these differs from the other in the names of books, and the numeration of some chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To reconcile these three versions, one might write 267 <rename>s and 6 <equate>s. Applying these actions to all three versions can take about a minute (tested on computer with an Intel i5-8250U, 12 GB ram), before any other significant validation checks on the <body> of the class-2 file. Normal validation takes about a minute and a half. If such processing times are unacceptable for your needs, you are advised to keep <adjustments>s to a minimum or to apply them to relatively small texts.

Further, adjustment actions were intended primarily to support the alignment process, and so were designed to apply select changes to sources. If a source must be changed in numerous places to reconcile it with other sources, it might be better to create a new version of the source organized according to the target reference system. Then in both the new and original versions of the class-1 files insert <redivision>, <predecessor>, <successor>, or <see-also> to link the two versions.

There is a TAN application that remodels one text in the image of another. See applications/remodel/remodel via TAN-T.xsl. The output of that application requires editing, but it can reduce the amount of work required. TAN tools for oXygen's author mode can also be used to correct that newly segmented text. These and related applications are under development, and may not function as expected. Improvement of these tools is scheduled for future releases of TAN.

Class 2 Data (`<body>`)

Data differs greatly between the class 2 formats. However, they all share one thing in common: the <body> consists of a series of claims, and responsibility for those claims should be attributed to the persons, organizations, or algorithms making the claims. Therefore, each <body> may take @claimant and perhaps @claim-when, specifying by IDref who should be credited or blamed with the material. If either attribute is missing, it is assumed that the claims are the responsibility of the persons listed in <file-resp> at the time of the latest date or date-time. The values of @claimant and @claim-when are weakly inheritable.

Class 2 Pointer Syntax: Referencing Texts

The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'."

Those familiar conventions are the basis for the TAN pointer syntax, which differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers apply common reference terminology to four strata of a text: works, divisions, word tokens, and characters. Works, defined above (see the section called “One Work”), are declared by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of the text in those divisions, defined according to one or more <token-definition>s declared in the class-2 file. And characters are defined as individual base letters in a word token (modifier characters are grouped with the preceding base character; see the section called “Combining characters”).

This approach not only makes the syntax human readable but mitigates the effect of changes to the sources. For example, if a <div> is deleted, moved, or changed, the alteration affects only references specific to that section; the rest of the reference system remains intact.

The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or TAN examples, to see how they are used in practice.

Referencing Works: `@work`

Class-2 files refer to works via meaningful IDrefs that point to the class-1 sources that transcribe the work or work-version, e.g., work="hamlet". The reference is understood to apply not merely to that particular source, but to any TAN-T file that claims to transcribe that work or work-version. (On the relationship between works and work-versions see the section called “Domain model”.) Thus, the id of the source-scriptum becomes a proxy or alias for the work. A vocabulary item <work> may also be used; its @xml:id provides a way to refer to a work without requiring a corresponding source.

Because TAN-A-tok and TAN-A-lm files deal with source-specific claims, the data for those formats do not refer to works. Only TAN-A <claim>s refer to works.

Referencing Divisions: `@ref`

Portions of text, i.e., <div>s, perhaps altered if <adjustments>s have been invoked (see the section called “Metadata (<head>)”, are pointed to via @ref. A @ref is constructed by taking the values of @n in the <div> in question along with its ancestor <div>s, and joining them with non-word characters. For example, @ref="I.1.1" might point to the following:

<div type="act" n="1">
   <div type="scene" n="1">
      <div type="line" n="1">
         . . . . . .
      </div>
      . . . . . .
   </div>
   . . . . . .
</div>

A @ref can express sequences and ranges of <div>s. In the example ref="1.2-4, 1.5", the hyphen and comma, which are reserved to signify ranges and series, are reserved. A hyphen always means "from...through" and a comma always means "and". Take note, if you are accustomed to editing conventions that use the comma as a subordinating punctuation mark. In the TAN format, commas are always paratactic, not hypotactic. For example, if referring to Hamlet, ref="I,2,3" is not a single reference to <div>, act I scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how the commas in the attribute value behave like the commas in the written phrase).

The periods (full stops) in @ref="I.1.1" are hypotactic markers, but they are arbitrary, and could be replaced with any mix of non-word character you like (except the hyphen or comma), including spaces, e.g., ref="I:1 1". The numeral system is also arbitrary. You may use any supported numeral systems (see section on numeration systems), even if the source uses a different one. Semantic equivalents to the preceding example are ref="A I i" and ref="1:a:I". Just remember, if you use either the Roman numeral system or alphabetic sequences, include a <numerals> in the <head> to specify which system should prevail in case of ambiguities (e.g., whether c means 3 or 100). Roman numerals are the default, but it is a good idea to be explicit.

Referencing Tokens: `@pos` and `@val`

To point to a token one normally uses <tok>, with one or more attributes, in three possible configurations:

@val or @rgx alone: one or more tokens are pointed to by value. For example, val = "bird", points to every occurence of the token bird; rgx = "b.+d" finds every word that begins with a b, ends with a d, and has some characters in-between. Every value of @rgx is implicitly bound to the beginning and end of the string (see below).
@pos alone: one or more tokens are pointed to by numerical position, via one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a passage. The numerical value to which the keyword last resolves depends upon the context length.
@val or @rgx combined with @pos: a combination of the previous two methods. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird.

During Schematron validation, if @pos is missing, it is assumed to mean * or 1 - last; if neither @val nor @rgx appear, the assumption is @rgx with value .+ (any characters). That is, @pos by default points to every instance and @val/@rgx by default points to any string.

When using @pos make sure you know the context. For example, the attribute combination val="bird" pos="last-1" will produce an error if the token bird does not occur at least two times in the given context.

It is advisable to use @val, and not merely @pos. If your source's text changes, and there is no @val, it may be difficult to determine the original intent of a claim, to determine whether changes need to be made. Furthermore, @val is generally speaking more efficient to process than is @rgx. A @rgx is more efficient to process only if it replaces numerous instances of @val.

@rgx is a regular expression that must match an entire word-token. For example, @rgx="re.d" will match the tokens "rend" and "read" but will not match "already", "rends", or "bread". If you wish to allow for characters at the beginning or end, use ".*re.d.*". For more on regular expressions, see the section called “Regular Expressions”.

Referencing Characters: `@chars`

Individual letters are always specified by @chars, which points to a specific position, e.g., chars="2, 7, last". Combining characters are excluded from these counts; see the section called “Combining characters”.

Prev	Up	Next
Transcriptions Using the Text Encoding Initiative (`<TEI>`)	Home	Division-Based Annotations and Alignments (<TAN-A>)