Table of Contents
This chapter provides general background to the elements and attributes that are common to class 2 TAN files. For detailed discussion see Chapter 8, TAN patterns, elements, and attributes defined.
At present, class 2 files are restricted to alignment or lexico-morphology.
Alignment files come in two different formats, identified by the root element. TAN-A-div provides macroscopic alignment; TAN-A-tok, microscopic. TAN-A-div aligns one or more class 1 files. It is intended for broad, general alignments of any number of versions of any number of works. The scope of TAN-A-tok is more restricted, to two class 1 files, allowing one to declare alignments with detailed specificity, certainty, and type between words (tokens). TAN-A-div focuses on works, regardless of version; TAN-A-tok focuses on individual versions.
Lexico-morphology files (also called part-of-speech files), TAN-LM, are used to encode the lexical headwords and morphological forms of individual words in class 1 files.
The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'."
The TAN pointer syntax differs from other pointer systems (e.g., URLs, XPath, and
XPointer) in that it depends upon a hierarchy of four features: works, divisions,
word tokens, and characters. Works, defined above (see the section called “One work”), are defined by the source
(which may not have more than one work). Divisions are defined
by the <div>
structure of each
source. Tokens are words of those divisions, defined according
to one or more tokenization rules. And characters are defined as
non-modifying codepoints in a word token. (A modifying character are treated as a
piece with the non-modifying base character it modifies.)
Parts of this fourfold hierarchy—works, divisions, tokens, and characters—are
named with vocabulary that the editor of a class 2 file finds most useful. Sources
are given a nickname (e.g., xml:id =
"hamlet-1741"
); divisions are named using the values for @n
; tokens are referred to by position, by
their actual values, or both (e.g., pos =
"1 - 5", pos = "last-1 - last", val = "hath"
; see the section called “@pos and @val”). Characters are always identified by number (e.g.,
chars = "2, 7"
).
This approach not only makes the syntax human readable, it also mitigates any
disruptions that corrections or alterations might incur. For example, if an
incorrectly duplicated <div>
is
deleted, disruption to the reference system is isolated and does not affect the rest
of the document.
Some Class 2 files may be time-consuming to validate fully. The length of the
<body>
could be enormous.
Or the number and length of sources may be taxing. Or validation may depend upon
time-consuming transformations of the source documents. Most oftentimes, this
problem affects TAN-A-div files, so to facilitate editing within an XML editor,
where regular validation is essential, Schematron validation falls into one of two
phases:
basic: All regular Schematron tests
are suspended, and reports are devoted exclusively to assisting in
looking for and checking the validity of references in <div-ref>
and <tok>
.
verbose: complete testing of class-2 files, including checks on source files to determine whether they adhere to the LDUR (see the section called “Flattened References, and the Leaf Div Uniqueness Rule”). In addition, information is given on where there are discrepancies in the numeration system across versions of the same work.
If you do not specify in the prolog which phase you intend to be the default, you will be prompted for the phase you wish to use whenever you validate the file.
<head>
)Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above.
All class 2 files have as their sources nothing other than class 1 files.
Therefore each <source>
must
take the the section called “Digital Entity Metadata Pattern”. Because the rights have
already been declared in the source files, <rights-source-only>
is disallowed.
Editors of class 2 files must be able to name or number word-tokens in a
transcription, via an optional <token-definition>
. See the section called “Defining Words and Tokens”.
There may be some cases where a source has a div type that is unnecessary, is
confusing, or should be ignored. One or more optional <suppress-div-types>
s
may be used to specify division types that you wish to suppress in
references.
Optional <rename-div-ns>
provide a convenient way to provisionally
rename @n
values. This is useful
for cases where you wish to use division labels that more familiar to users of the
class 2 files, or are easier to edit and read. It can also be used to harmonize
discordant @n
values, especially
helpful for divs that are named, not numbered, such as the books of the
Bible.
<body>
)The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats.
Many class 2 elements take @src
and @ref
.
@src
points via ID
reference to one or more <source>
s and @ref
points to one or more <div>
s through their flat ref
(perhaps substituted with their new values if <rename-div-ns>
have been
invoked (see the section called “Metadata (<head>)”).
In the example ref = "1.2-4,
1.5"
, the periods are arbitrary (but the hyphen and comma, which have
special meanings here, are not). You may use any punctuation you wish, or even
space, but it is recommended you use what will be most familiar to users. You may
use non-Arabic numerals, regardless of the numbering system used by your sources.
@chars
and @pos
follow a useful compact syntax,
described below (the section called “@pos and @val”).
To point to a token, one of three methods may be used.
@pos
alone. Under this method, one or
more digits, or the phrase last
or last-
plus a
digit, joined by hyphens or commas indicate one or more token numbers.
For example, 2, 4-6, last-2 - last
refers to the second,
fourth, fifth, sixth, antepenult, penult, and final tokens in a sequence
of word tokens. The numerical value to which the keyword
last
resolves depends upon the context of each source and
ref.
@val
alone. Under this method, a
single token is picked by means of a string value equivalent to the
token. For example, @val =
"bird"
, points to the first occurence of the token
bird
.
@pos
and @val
together. Under this method, specific occurences of a token
are picked. For example, @val="bird" @pos="2,
4"
picks the second and fourth occurences of the token
bird
.
Any time @pos
appear in an
element, and @val
doesn't,
@val
is assumed to allow
matches to any word. Vice versa, if @val
appears but @pos
doesn't, the latter is assumed to equal 1
.
@pos
and @val
must be used carefully. For
example, the attribute combination val="bird" pos="last-5"
will
produce an error if the word token bird
does not occur at least six
times.