<head>
)No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following:
the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise have an important relationship;
the most significant parts of the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.
To answer these questions completely, consistently, and predictably the
<head>
, a mandatory child
of the root element, takes a common pattern across all TAN
formats, thus allowing anyone to easily and predictably work across large numbers and
types of TAN files. The TAN <head>
, intended to be concise and focused, compels you to
provide metadata for the data that is governed by <body>
, but it does not accommodate metadata for the
metadata. That is, your metadata should focus on the data itself and not other
things. For example, <head>
requires you name the people who helped create or edit the data, but you are not
expected to tell us about them. Merely give good <IRI>
s that point to authoritative sources that provide
background information.
Note | |
---|---|
The principles above explain why the TEI extension of TAN requires two
heads, one for TEI and the other for TAN. |
Detailed descriptions of <head>
and its components are in Chapter 8, TAN patterns, elements, and attributes defined. Here we provide a summary, general
description of TAN metadata.
To describe the current file, <head>
takes one or more <name>
s, zero or more <desc>
s and <master-location>
s, one
<license>
.
Next come a list of files upon which the file
depends: zero or more <inclusion>
s, zero or more <key>
s, zero or more <source>
s, and zero or more <see-also>
s.
All editorial assumptions are placed in
<definitions>
,
whose contents differ from one TAN format to the next.
Finally comes the responsibility section stating
who did what when: one or more <person>
s, <role>
s, and <change>
s, and zero or more <resp>
s.
Two TAN elements cover rights and licenses: <license>
(mandatory in every TAN
file) and <licensor>
.
The first element defines the license under which you are releasing your data; the
second specifies who has licensed the data.
The license applies only to the file itself, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional (see below).
When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the license of a TAN file and that of its sources you should respect the more restrictive one. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions.
If you wish to indicate what license governs a source, use <desc>
in <source>
.
TAN adopts the Creative Commons licenses as its default key vocabulary. See the section called “TAN keywords for types of rights (<license>)”.
Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another. TAN has two features—keys and inclusions—that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files.
In general, you should first work with keys. If they are not doing the job you need, then try inclusions.
Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor, roles, and division types.
Projects are advised to create their own <TAN-key>
files populated
with commonly used vocabulary.
Using those files is a two-step process. First, the TAN-key file is declared
via <key>
. Second,
elements (normally in <definitions>
) can take @which
instead of the
customary IRI + name pattern. @which
points to a <name>
in the TAN-key
file.
TAN includes a number of standard TAN-key files located at http://textalign.net/release/TAN-2018/TAN-key/ and
documented in Chapter 9, Official TAN keywords. Any element that takes
@which
can take full
advantage of those files, without <key>
.
It is strongly recommended that you depend upon only TAN-key files you have written, and not those of a different project.
More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file. As with keys, TAN inclusion is a two-step process.
First, a TAN file is made available for inclusion via <inclusion>
s (inside <head>
). Like <key>
, an <inclusion>
does nothing on
its own. It merely indicates a file that may be used for inclusions.
Second, elements that allow it make take @include
, which points to the
@xml:id
reference of
the <inclusion>
. In
the validation process, those elements will be replaced with every element of
that name found in the inclusion file, checked recursively (see below), and
ignoring duplicated elements.
<inclusion>
s are
critically important to the content of the TAN file, so any file with
<inclusion>
s
that cannot be located will be regarded as being in fatal error. Because of the
importance of access to included files, it is strongly recommended that
inclusions be limited to files locally available, in the same project.
Inclusions are recursive. If a TAN file A has <x
include='B'>
and file B has <x include='C D E'>
then
file A will be given all <x>
s found in B, C, D, and E.
In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include the same elements.
TAN inclusion removes elements from their original context, which means that
values that must be interpreted locally are converted before the elements are
included. For example, @which
must be interpreted in light of the included
document's keys, not those of the including document. Similarly, different
numeration systems, e.g., Roman numerals, must be interpreted locally and
converted, before inclusion (see the section called “One reference system”).
<source>
s
and <see-also>
sCreating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files.
To answer this, first determine a class 1 file's <source>
. Everything else is then
a <see-also>
.
If you find in the course of editing that you are starting to depend upon the
source of your source, then that earlier version should be credited as the
<source>
and the file
you were using should be moved to <see-also>
.
Many attributes are not inheritable, e.g., @xml:id
. Others are inheritable,
indicating something about the host element and all its descendants. When a
descendant has the same attribute, the default behavior is for the new attribute
to cancel any inherited ones, e.g., @xml:lang
, @affects-element
, @claimant
. In other cases, the inherited effect is additive,
e.g., @cert
. Consult individual
attribute entries to understand an attribute's behavior.
Some attributes in an element have priority for interpretation. @claimant
, for example, has
priority over @cert
second.
That is, the two attributes in the same element are to be interpreted to mean:
"@claimant
has
@cert
confidence about
the following claim:...."
At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably taken to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines.
TAN adopts the proximate term token—a word that is defined not according to grammar but according to a regular expression (see the section called “Regular Expressions”).
A TAN token is a reference pointer, not a linguistic marker. To define a token
in TAN does not entail any linguistic commitments. Neither editors nor users of
TAN data should infer that a <tok>
points to a morpheme, a lexeme, or any other
linguistic entity. There will frequently be a fortuitous correlation between the
two, but it is not guaranteed. In TAN, a token is purely a method of
reference.
TAN was developed in service of ancient literature, where punctuation is
generally ignored as being late or not central to the text. Even in contemporary
use, most people ignore punctuation when they count words. Therefore the default
<token-definition>
defines a token as being any
continuous string of word characters, the soft hyphen, the zero-width space, or
the zero-width joiner, formally defined:
<token-definition regex="[\w­​‍]+"/>
This pattern will result in a close resemblance to what is ordinarily thought
of as words, but perhaps with some surprises (see above, the section called “Regular Expressions”). If no <token-definition>
is
explicitly given, the pattern above will be assumed.
If you are working with modern texts, where punctuation might be important to
name and number, try the built-in keyword letters and
punctuation
:
<token-definition regex="\w+|[^\w\s]"/>
This expression defines a token as a sequence of word characters or any single
character that is neither a word nor a space. The string "(I go!)
"
(the text inside the quotation marks) would have five tokens: ( I go !
)
.
Above are two built-in, TAN-defined <token-definition>
s. You may customize your own
<token-definition>
to suit your needs. But keep in mind
that TAN files were meant to be shared across fields and disciplines. You are
encouraged to to define tokens in manner customary to users of the text.
Specialized definitions make it less likely that your TAN file will be able to
mesh well with other TAN files. Two class-2 files annotating the same class-1 file
cannot be easily compared or synthesized if they use different definitions of
token.
Given those caveats, consider a specialized case, where you wish to prepare
your transcriptions such that certain Unicode characters precisely delimit tokens
that are synonymous with a particular linguistic category, say lexeme. Say, for
example, you use specialized control characters (e.g., U+200C ZERO WIDTH
NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text
of your class 1 file. You might then create a <token-definition>
like this:
<token-definition regex="[^\p{Cf}\s]+"/>
The statement defines a token as any consecutive sequence of non-spacing and non-control format characters.
Such customized approaches may make the technique unwieldy or impossible to
use, thereby limiting your TAN file's interoperability and utility. It is
recommended that if you use control formatting characters or other special
characters that are invisible to use the xml entity, e.g.,
‍
, so they can be seen in your file.