Table of Contents
This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see Chapter 12, TAN patterns, elements, and attributes defined.
There are three types of class-2 files:
TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also supports a wide variety of annotations on texts.
TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts.
TAN-A-lm files express annotations pertaining to lexico-morphology (grammatical part-of-speech), for either a single class-1 file or a language in general.
In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.
<head>
)Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below.
All class-2 files have as their sources nothing other than class-1 files.
Therefore each <source>
must
take the the section called “Digital entity metadata pattern”.
Editors of class-2 files must be able to name or number word-tokens in a
transcription, and to determine an appropriate definition of "token," via an
optional <token-definition>
. See the section called “Defining words and tokens”.
Inevitably, some class 1 sources for the same work will differ from each other.
Perhaps works or div types were not defined with the same IRIs, or perhaps one
version follows an idiosyncratic reference system. If sources need to be
reconciled, alterations may be specified in <adjustments>
, which
stipulates a set of actions that should be applied to the sources that have been
named. The following adjustment actions are supported:
These adjustment actions allow you to reconcile discordant sources without changing them directly.
Skips, renames, and equates are first applied to the source as received. If a
particular source <div>
is
the target of more than one adjustment action, only the first one will be applied
according to action priority: <skip>
, <rename>
based on @ref
, <rename>
based on @n
, then <equate>
. This action priority also corresponds to the
amount of time needed to process the adjustments. Numerous <skip>
actions are applied very
quickly. Numerous <reassign>
s however can be time-consuming, because it
requires tokenizing the text.
Because of this priority order, some actions might not be performed. For
example, if you deeply skip a <div>
, no renaming adjustments will be made to its children.
Skips, renames, and equates are applied in one pass, based on the original
reference system, then <reassign>
s are applied to the the newly adjusted
source. If you rename a div, then want to reassign it, you must do so based on the
new name, not the original.
Each adjustment action adds time to the validation routines. On lengthy texts
these can become quite time-consuming. Take, for example, the Tanakh / Old
Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of
these differs from the other in the names of books, and the numeration of some
chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To
completely reconcile these three versions requires at least 1 <skip>
, 237 <rename>
s and 3 <equate>
s, and 31 <reassign>
s. Applying these
actions to all three versions can take about two minutes (tested on computer with
an Intel i5-8250U, 12 GB ram), before any other significant validation checks on
anything insed the <body>
of
the class-2 file.[17] If such processing times are unacceptable, you are advised to keep
<adjustments>
s
to a minimum or to apply them to relatively small texts.
Further, adjustment actions were intended primarily to address common
irregularities between files, to apply some last minute touches, or perhaps to
drop certain parts of texts. Adjustments were not designed to provide extensive,
deep corrections. If a source must be changed in numerous places to reconcile it
with other sources, you should create a new version of the source, reorganized as
you prefer. Then in both the new and original versions of the class-1 files insert
<redivision>
,
<predecessor>
,
<successor>
, or
<see-also>
to link
the two versions.
There is a TAN application that remodels one text in the image of another. See
applications/remodel/remodel text.xsl
. The output of that
application requires editing, but it can reduce the amount of work required. TAN
tools for Oxygen's author mode can also be used to correct that newly segmented
text.
<body>
)Data types differ greatly between the class 2 formats. However, they all share
one thing in common: the <body>
consists of a series of claims, and responsibility for
those claims should be attributed to the persons, organizations, or algorithms
making the claims. Therefore, each <body>
may take @claimant
and perhaps @claim-when
, specifying by
IDref who should be credited or blamed with the material. If either attribute is
missing, it is assumed that the claims are the responsibility of the persons
listed in <file-resp>. The values
of @claimant
and
@claim-when
are
weakly inheritable.
The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'."
Those familiar conventions are the basis for the TAN pointer syntax, which
differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
apply common reference terminology to four strata of a text: works, divisions,
word tokens, and characters. Works, defined above (see the section called “One Work”), are declared by the source
(which may not have more than one work). Divisions are
defined by the <div>
structure
of each source. Tokens are words of the text in those
divisions, defined according to one or more <token-definition>
s
declared in the class-2 file. And characters are defined as
individual base letters in a word token (any modifier character is treated in
concert with the last preceding base character; see the section called “Combining characters”).
This approach not only makes the syntax human readable but mitigates the effect
of changes to the sources. For example, if a <div>
is deleted, moved, or changed, the alteration
affects only references specific to that <div>
and its descendants; the rest of the reference
system remains intact.
The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or study TAN examples, to see how they are used in practice.
@work
This section applies only to TAN-A files, because the other class-2 files do not make claims about works per se.
TAN-A files refer to works via meaningful IDrefs that point to the class-1
sources that transcribe the work/work-version, e.g., work="hamlet"
. The reference is
understood to apply not merely to that particular source, but to any TAN-T file
that claims to transcribe that work or work-version. (On the relationship
between works and work-versions see the section called “Domain model”.) Thus, the
id of the source-scriptum becomes a proxy or alias for the work.
Any work may also be defined through a vocabulary item <work>
, either locally in the
<vocabulary-key>
or in a TAN-voc file linked via
<vocabulary>
. The work would then be referred to by
@xml:id
or
<name>
of the
particular vocabulary item.
@ref
Portions of text, i.e., <div>
s, perhaps altered if <adjustments>
s have
been invoked (see the section called “Metadata (<head>)”, are pointed to via
@ref
. A @ref
is constructed by taking the
values of @n
in the <div>
in question along with its
ancestor <div>
s, and joining
them with non-word characters. For example, @ref="I.1.1"
might point to the following:
<div type="act" n="1">
<div type="scene" n="1">
<div type="line" n="1">
. . . . . .
</div>
. . . . . .
</div>
. . . . . .
</div>
A @ref
can express
sequences and ranges of <div>
s. In the example ref="1.2-4, 1.5"
, the hyphen and comma, which are reserved
to signify ranges and series, are reserved. A hyphen
always means "from...through" and a comma always means "and". In
the TAN format, commas are always paratactic, not hypotactic. For example, if
referring to Hamlet, ref="I,2,3"
is not a single reference to <div>
, act I scene 2 line 3, but
rather three of them: act I, act 2, and act 3 (notice how the commas in the
attribute value behave like the commas in the written phrase). If you mean to
say act I, scene 2, line 3 try ref="I.2.3"
or ref="I 2 3"
.
The periods (full stops) in @ref="I.1.1"
are hypotactic markers, but they are arbitrary,
and could be replaced with any mix of non-word character you like (except the
hyphen or comma), including spaces, e.g., ref="I:1 1"
. The numeral system is also arbitrary. You may
use any supported numeration system (see section on numeration systems), even if the source uses a different
one. Semantic equivalents to the preceding example are ref="A I i"
and ref="1:a:I"
. Just remember, if you
use either the Roman numeral system or alphabetic sequences, include a
<numerals>
in
the <head>
to specify which
system should prevail in case of ambiguities (e.g., whether c
means 3 or 100). Roman numerals are the default, but it is a good idea to be
explicit.
To point to a token one normally uses <tok>
, with one or more attributes, in three possible
configurations:
@val
or @rgx
alone: one or more tokens are
pointed to by value. For example, val = "bird"
, points to every occurence of the
token bird
; rgx = "b.+d"
finds every word that begins with a
b, ends with a d, and has some characters in-between. Every value of
@rgx
is
implicitly bound to the beginning and end of the string (see
below).
@pos
alone: one or more tokens are
pointed to by numerical position, via one or more digits, or the
phrase last
or last-
plus a digit, joined by
hyphens or commas. For example, 2, 4-6, last-2 - last
refers to the second, fourth, fifth, sixth, antepenult, penult, and
final tokens in a passage. The numerical value to which the keyword
last
resolves depends upon the context length.
@val
or @rgx
combined with
@pos
: a combination of the
previous two methods. For example, @val="bird" @pos="2, 4"
picks the second and fourth occurences
of the token bird
.
During Schematron validation, if @pos
is missing, it is assumed to mean *
or
1 - last
; if neither @val
nor @rgx
appear, the assumption is @rgx
with value .+
(any characters). That is, by default, @pos
points to every instance and @val
/@rgx
to every token.
When using @pos
make sure
you know the context. For example, the attribute combination val="bird"
pos="last-1"
will produce an error if the token bird
does not occur at least two times in the given context.
It is advisable to use @val
or perhaps @rgx
, and not merely @pos
. If your source's text changes, and there is no
@val
, it may be
difficult to determine the original intent of a claim, to determine whether
changes need to be made. @val
is easier than @rgx
to
process in applications, particularly when compiling statistics or estimating
probabilities. Furthermore, @val
is generally speaking more efficient to process than is
@rgx
. A @rgx
is more efficient only if it
replaces numerous instances of @val
.
@rgx
is a regular
expression that must match an entire word-token. For example, @rgx="re.d"
will match the tokens
"rend" and "read" but will not match "already", "rends", or "bread". If you
wish to allow for characters at the beginning or end, use
".*re.d.*"
. For more on regular expressions, see the section called “Regular expressions”.
@chars
Individual letters are always specified by @chars
, which points to a
specific position, e.g., chars="2,
7, last"
. Combining characters are excluded from these counts; see
the section called “Combining characters”.