Table of Contents
This chapter provides general background to class 2 TAN files. For detailed discussion of individual elements and attributes see Chapter 8, TAN patterns, elements, and attributes defined.
TAN-A-div files provide broad, macroscopic alignment of multiple versions of any number of works. It also provides a place for annotating the texts through general claims.
TAN-A-tok files provide narrow, microscopic alignment of any two class 1 files, identifying word-for-word or character-for-character correspondence.
TAN-A-lm files support lexico-morphology (part-of-speech) for either a single class 1 file or a language.
In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for cases where it may not be clear which is the target and which is the source. Further, there is a more generic use of source that takes precedent. In these guidelines, therefore, we avoid the term target altogether, and when we use the word source, we are referring only to one of the class 1 files upon which a class 2 alignment depends.
The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'."
Those familiar conventions are the basis for the TAN pointer syntax, and so it
differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
depend upon a fourfold hierarchy of: works, divisions, word tokens, and characters.
Works, defined above (see the section called “One work”), are defined by the source (which may not have more than one
work). Divisions are defined by the <div>
structure of each source.
Tokens are words of those divisions, defined according to one
or more tokenization rules. And characters are defined as
non-modifying codepoints in a word token. (A modifying character is always included
with the base character it modifies.)
Parts of this fourfold hierarchy—works, divisions, tokens, and characters—normally
have familiar names. Sources can be given a meaningful abbreviated name (e.g.,
xml:id = "hamlet-1741"
);
divisions are named according to @n
;
tokens are referred to by position, by their actual values, or both (e.g.,
pos = "1 - 5", pos = "last-1 - last", val = "hath"
; see the section called “@pos and @val”). Characters are always identified by number (e.g.,
chars = "2, 7"
).
This approach not only makes the syntax human readable, it also mitigates
disruptions from corrections to the dependencies. For example, if an incorrectly
duplicated <div>
is deleted,
disruption to the reference system is isolated and does not affect the rest of the
document.
<head>
)Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above.
All class 2 files have as their sources nothing other than class 1 files.
Therefore each <source>
must
take the the section called “Digital Entity Metadata Pattern”.
Editors of class 2 files must be able to name or number word-tokens in a
transcription, via an optional <token-definition>
. See the section called “Defining Words and Tokens”.
Inevitably, some class 1 sources will have differences. Perhaps works or div
types were not defined with the same IRIs, or perhaps one version follows an
idiosyncratic reference system. If sources need to be reconciled, alterations are
specified in <alter>
,
which stipulates a set of actions that should be applied to the sources that have
been named. Alteration actions include:
These actions allow you to reconcile sources that are somewhat at odds. Actions are applied first hierarchically and then in the sequence stated above. That is, the validation routine will go level by level through a given source. Any rules that are found in one level will be applied (skips taking top precedence, reassigns the lowest) before moving to the next level of the source. So if you wish in a given source to change chapter 1 to chapter 2, any subdivisions will be collated. If you wanted to do further things with (original) 1.5, you would need to refer to it as 2.5, and you would also need to realize that if original 2.5 exists, the action will be applied to both.
Each action adds time to the validation routines. On lengthy texts these can
become quite time-consuming. You are advised to keep <alter>
s to a minimum. If a
source has numerous alterations, you find it less time-consimung to create a new
version of a source.
<body>
)The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats.
Many class 2 elements take @src
and @ref
.
@src
points via ID
reference to one or more <source>
s and @ref
points to one or more <div>
s through their flat ref,
perhaps substituted with their new values if <alter>
s have been invoked (see the section called “Metadata (<head>)”.
In the example ref = "1.2-4,
1.5"
, the periods are arbitrary (but the hyphen and comma, which have
special meanings here, are not). You may use any separating punctuation or space
you wish, except for hyphens and commas, which are reserved to create ranges and
joins. You may also use other numeral systems.
To point to a token, one of three methods may be used.
@pos
alone. Under this method, one or
more digits, or the phrase last
or last-
plus a
digit, joined by hyphens or commas indicate one or more token numbers.
For example, 2, 4-6, last-2 - last
refers to the second,
fourth, fifth, sixth, antepenult, penult, and final tokens in passage.
The numerical value to which the keyword last
resolves
depends upon the length of each <div>
.
@val
alone. Under this method, a
single token is picked by means of a string value equivalent to the
token. For example, @val =
"bird"
, points to the first occurence of the token
bird
.
@pos
and @val
together. Under this method, specific occurences of a token
are picked. For example, @val="bird" @pos="2,
4"
picks the second and fourth occurences of the token
bird
.
During validation, if @pos
or
@val
are missing, they are
supplied with their default values, 1
and .+
respectively. That is, @pos
by
default points to the first instance and @val
by default points to any string.
@pos
and @val
must be used carefully. For
example, the attribute combination val="bird" pos="last-5"
will
produce an error if the word token bird
does not occur at least six
times.
It is advisable to use @val
,
and not merely @pos
. If the
editor makes corrections to your source texts, references are more likely to
become corrupt, and less likely to be traceable, if there is no @val
.