Table of Contents
This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see Chapter 9, TAN patterns, elements, and attributes defined.
There are three types of class-2 files:
TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also allows annotations of texts, in the form of claims.
TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts.
TAN-A-lm files express annotations pertaining to lexico-morphology (part-of-speech), for either a single class-1 file or a language in general.
In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.
<head>
)Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below.
All class-2 files have as their sources nothing other than class-1 files.
Therefore each <source>
must
take the the section called “Digital Entity Metadata Pattern”.
Editors of class-2 files must be able to name or number word-tokens in a
transcription, and to determine an appropriate definition of "token," via an
optional <token-definition>
. See the section called “Defining Words and Tokens”.
The declaration <numerals>
at present does not allow you to customize a
numeration system for sources. A future release of TAN may support such a
feature.
Inevitably, some class 1 sources for the same work will differ from each other.
Perhaps works or div types were not defined with the same IRIs, or perhaps one
version follows an idiosyncratic reference system. If sources need to be
reconciled, alterations may be specified in <adjustments>
, which
stipulates a set of actions that should be applied to the sources that have been
named. Adjustment actions:
These adjustment actions allow you to reconcile discordant sources without changing them directly.
Skips, renames, and equates are first applied to the source as received. If a
particular source <div>
is
the target of more than one adjustment action, only the first one will be applied
according to action priority: <skip>
, <rename>
based on @ref
, <rename>
based on @n
, then <equate>
. Because of this priority order, some actions
might not be performed. For example, if you deeply skip a <div>
, no renaming adjustments will
be made to its children. If you have renamed a div, then want to reassign it, you
must do so based on the new name, not the original. You should be aware of the
consequences of your adjustments.
After skips, renames, and equates are applied, <reassign>
s are applied to
the the newly adjusted source.
Each adjustment action adds time to the validation routines. On lengthy texts
these can become quite time-consuming. Take, for example, the Tanakh / Old
Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of
these differs from the other in the names of books, and the numeration of some
chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To
reconcile these three versions, one might write 267 <rename>
s and 6 <equate>
s. Applying these
actions to all three versions can take about a minute (tested on computer with an
Intel i5-8250U, 12 GB ram), before any other significant validation checks on the
<body>
of the class-2
file. Normal validation takes about a minute and a half. If such processing times
are unacceptable for your needs, you are advised to keep <adjustments>
s to a
minimum or to apply them to relatively small texts.
Further, adjustment actions were intended primarily to support the alignment
process, and so were designed to apply select changes to sources. If a source must
be changed in numerous places to reconcile it with other sources, it might be
better to create a new version of the source organized according to the target
reference system. Then in both the new and original versions of the class-1 files
insert <redivision>
,
<predecessor>
,
<successor>
, or
<see-also>
to link
the two versions.
There is a TAN application that remodels one text in the image of another. See
applications/remodel/remodel via TAN-T.xsl
. The output of that
application requires editing, but it can reduce the amount of work required. TAN
tools for oXygen's author mode can also be used to correct that newly segmented
text. These and related applications are under development, and may not function
as expected. Improvement of these tools is scheduled for future releases of
TAN.
<body>
)Data differs greatly between the class 2 formats. However, they all share one
thing in common: the <body>
consists of a series of claims, and responsibility for those claims should be
attributed to the persons, organizations, or algorithms making the claims.
Therefore, each <body>
may
take @claimant
and perhaps
@claim-when
,
specifying by IDref who should be credited or blamed with the material. If either
attribute is missing, it is assumed that the claims are the responsibility of the
persons listed in <file-resp>
at the time of the latest date or date-time.
The values of @claimant
and
@claim-when
are
weakly inheritable.
The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'."
Those familiar conventions are the basis for the TAN pointer syntax, which
differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
apply common reference terminology to four strata of a text: works, divisions,
word tokens, and characters. Works, defined above (see the section called “One Work”), are declared by the source
(which may not have more than one work). Divisions are
defined by the <div>
structure
of each source. Tokens are words of the text in those
divisions, defined according to one or more <token-definition>
s
declared in the class-2 file. And characters are defined as
individual base letters in a word token (modifier characters are grouped with the
preceding base character; see the section called “Combining characters”).
This approach not only makes the syntax human readable but mitigates the effect
of changes to the sources. For example, if a <div>
is deleted, moved, or changed, the alteration
affects only references specific to that section; the rest of the reference system
remains intact.
The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or TAN examples, to see how they are used in practice.
@work
Class-2 files refer to works via meaningful IDrefs that point to the class-1
sources that transcribe the work or work-version, e.g., work="hamlet"
. The reference is
understood to apply not merely to that particular source, but to any TAN-T file
that claims to transcribe that work or work-version. (On the relationship
between works and work-versions see the section called “Domain model”.) Thus, the
id of the source-scriptum becomes a proxy or alias for the work. A vocabulary
item <work>
may also be
used; its @xml:id
provides
a way to refer to a work without requiring a corresponding source.
Because TAN-A-tok and TAN-A-lm files deal with source-specific claims, the
data for those formats do not refer to works. Only TAN-A <claim>
s refer to
works.
@ref
Portions of text, i.e., <div>
s, perhaps altered if <adjustments>
s have
been invoked (see the section called “Metadata (<head>)”, are pointed to via
@ref
. A @ref
is constructed by taking the
values of @n
in the <div>
in question along with its
ancestor <div>
s, and joining
them with non-word characters. For example, @ref="I.1.1"
might point to the following:
<div type="act" n="1">
<div type="scene" n="1">
<div type="line" n="1">
. . . . . .
</div>
. . . . . .
</div>
. . . . . .
</div>
A @ref
can express
sequences and ranges of <div>
s. In the example ref="1.2-4, 1.5"
, the hyphen and comma, which are reserved
to signify ranges and series, are reserved. A hyphen
always means "from...through" and a comma always means "and".
Take note, if you are accustomed to editing conventions that use the comma as a
subordinating punctuation mark. In the TAN format, commas are always
paratactic, not hypotactic. For example, if referring to Hamlet, ref="I,2,3"
is not a single
reference to <div>
, act I
scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how
the commas in the attribute value behave like the commas in the written
phrase).
The periods (full stops) in @ref="I.1.1"
are hypotactic markers, but they are arbitrary,
and could be replaced with any mix of non-word character you like (except the
hyphen or comma), including spaces, e.g., ref="I:1 1"
. The numeral system is also arbitrary. You may
use any supported numeral systems (see section on numeration systems), even if the source uses a different
one. Semantic equivalents to the preceding example are ref="A I i"
and ref="1:a:I"
. Just remember, if you
use either the Roman numeral system or alphabetic sequences, include a
<numerals>
in
the <head>
to specify which
system should prevail in case of ambiguities (e.g., whether c
means 3 or 100). Roman numerals are the default, but it is a good idea to be
explicit.
To point to a token one normally uses <tok>
, with one or more attributes, in three possible
configurations:
@val
or @rgx
alone: one or more tokens are
pointed to by value. For example, val = "bird"
, points to every occurence of the
token bird
; rgx = "b.+d"
finds every word that begins with a
b, ends with a d, and has some characters in-between. Every value of
@rgx
is
implicitly bound to the beginning and end of the string (see
below).
@pos
alone: one or more tokens are
pointed to by numerical position, via one or more digits, or the
phrase last
or last-
plus a digit, joined by
hyphens or commas. For example, 2, 4-6, last-2 - last
refers to the second, fourth, fifth, sixth, antepenult, penult, and
final tokens in a passage. The numerical value to which the keyword
last
resolves depends upon the context length.
@val
or @rgx
combined with
@pos
: a combination of the
previous two methods. For example, @val="bird" @pos="2, 4"
picks the second and fourth occurences
of the token bird
.
During Schematron validation, if @pos
is missing, it is assumed to mean *
or
1 - last
; if neither @val
nor @rgx
appear, the assumption is @rgx
with value .+
(any characters). That is, @pos
by default points to every instance and @val
/@rgx
by default points to any
string.
When using @pos
make sure
you know the context. For example, the attribute combination val="bird"
pos="last-1"
will produce an error if the token bird
does not occur at least two times in the given context.
It is advisable to use @val
, and not merely @pos
. If your source's text changes, and there is no
@val
, it may be
difficult to determine the original intent of a claim, to determine whether
changes need to be made. Furthermore, @val
is generally speaking more efficient to process than
is @rgx
. A @rgx
is more efficient to process
only if it replaces numerous instances of @val
.
@rgx
is a regular
expression that must match an entire word-token. For example, @rgx="re.d"
will match the tokens
"rend" and "read" but will not match "already", "rends", or "bread". If you
wish to allow for characters at the beginning or end, use
".*re.d.*"
. For more on regular expressions, see the section called “Regular Expressions”.
@chars
Individual letters are always specified by @chars
, which points to a
specific position, e.g., chars="2,
7, last"
. Combining characters are excluded from these counts; see
the section called “Combining characters”.