<head>
)No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following:
the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise have an important relationship;
the most significant parts of the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.
To answer these questions completely, consistently, and predictably the
<head>
, a mandatory child
of the root element, takes a common pattern across all TAN
formats, thus allowing anyone to work easily and predictably across large numbers and
types of TAN files. The TAN <head>
, intended to be concise and focused, compels you to
provide metadata for the data that is governed by <body>
, but it does not accommodate metadata for the
metadata. That is, your metadata should focus on the data itself and not other
things. For example, <head>
requires you name the people who helped create or edit the data, but you are not
expected to tell us about them. You merely refer through <IRI>
to other authoritative sources
that can provide background information.
Note | |
---|---|
The principles above explain why the TEI extension of TAN requires two
heads, one for TEI and the other for TAN. Because of its design principles, the
|
Detailed descriptions of <head>
and its components are in Chapter 8, TAN patterns, elements, and attributes defined. Here we provide a summary, general
description of TAN metadata.
To describe the current file, <head>
takes one or more <name>
s, zero or more <desc>
s and <master-location>
s, and one
<rights-excluding-sources>
.
Next come a list of files upon which the file
depends: zero or more <inclusion>
s, zero or more <key>
s, zero or more <source>
s, and zero or more <see-also>
s.
All editorial assumptions are placed in
<declarations>
,
whose contents differ from one TAN format to the next.
Finally comes the responsibility section stating
who did what when: one or more <agent>
s, <role>
s, and <change>
s, and zero or more <agentrole>
s.
Two TAN elements cover rights and licenses: <rights-excluding-sources>
(mandatory in every TAN file)
and <rights-source-only>
(optional, and never allowed in class
2 files, because a statement on rights is required in each source). The first
element covers the work specific to a given TAN file. The second pertains to the
rights for the sources. The distinction is important, and helpful. It is much
easier for you to decide and state the rights and license behind your own work
than to do so for that of others. Declaring who holds what rights over your
source(s) may be not only difficult but risky, and is therefore optional (see
below).
As an editor, you are strongly encouraged in the <desc>
element of <rights-excluding-sources>
to emphasize the distinction
between the rights you have over your data and the rights held by others over your
source, for the benefit of those who may not be familiar with the TAN format. A
statement something like this is recommended: <desc>The data in this file, only insofar as it constitutes an
independent work, is licensed exclusive of any licenses held by parties over
the source or sources listed below.</desc>
When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the two licenses—that of a TAN file and that of its sources—you should respect the more restrictive license. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions.
It is recommended that you not declare who own what rights over your source
unless you are quite certain. Copyright laws differ from one country to another,
and they change. A source may be protected by copyright in one place and
simultaneously be in the public domain in another. (At the time of this writing,
dozens of scholarly editions of ancient texts are in the public domain in Germany,
where copyright of a new edition lasts forty years, but not in the U.S. or Canada,
where there is no explicit legislation on this issue.) Some copyright statements
in books are false, or cannot be proven. Some persons or entities who claim rights
over a source may have no legal basis for the claim, at least in some
jurisdictions. Furthermore, if you mischaracterize the rights that are held over a
source, you may be held liable by a putative rights holder. It is safer to use the
<IRI>
of <source>
(described below) to
point the user to a publisher or some other entitiy that has greater authority and
specificity about who owns what rights.
TAN adopts the Creative Commons licenses as its default key vocabulary. See the section called “TAN keywords for types of rights (<rights-excluding-sources><rights-source-only>)”.
Copyright Law versus Contract Law | |
---|---|
Some third-party services, such as the Thesaurus Linguae Graecae for
Greek texts, require users to agree not to copy and reuse the texts in
service's databases. Such agreements fall under the area of contract law and
not copyright law. That is, many of these third parties have no intellectual
property rights (or only derivative rights) over the texts they store.
Therefore, they should normally not be credited in any |
Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Such repetition makes the files prone to errors, where editorial corrections made in one place are mistakenly not made everywhere. TAN has two features that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files.
Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor. Writing individual IRI + name patterns can be time-consuming, and if a change needs to be made, it is easy to be inconsistent or incomplete.
Vocabulary commonly used in a project may be kept in a <TAN-key>
file. This file
is made accessible to any other TAN file via <key>
. The key vocabulary is then invoked by using
@which
, whose value
should match a <name>
value in the TAN-key file.
A number of standard keys have already been predefined, documented in Chapter 9, Official TAN keywords. It is strongly recommended that you not depend upon the supplementary TAN-key files of a different project. Rather you should develop your own. You may also wish to create a workflow where the TAN-key is used for private editing, but the published versions have their keywords resolved to their full value.
More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file.
As with keys, TAN inclusion is a two-step process. First, a TAN file is made
available for inclusion by invoking <inclusion>
s (inside <head>
). Like <key>
, an <inclusion>
does nothing on its own. It merely
indicates a file that may be used for patterned inclusions.
Inclusions are acted upon only in the second step. Many elements allow
@include
, which
points to the @xml:id
reference of an included file. In the validation process, those elements will
be replaced with every element of that name found in the inclusion file,
checked recursively (see below), and ignoring duplicated elements.
<inclusion>
s are
critically important to the content of the TAN file, so any file with
<inclusion>
s
that cannot be located will be regarded as being in fatal error. Because of the
importance of access to included files, it is strongly recommended that
inclusions be limited to files locally available, in the same project.
Inclusions are recursive. If a TAN file A has <x
include='B'>
and file B has <x include='C D E'>
then
the validator for file A will replace the element with all <x>
s
found in B, C, D, and E.
In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within the domain of a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include elements of the same name.
TAN inclusion removes elements from their original context, which means that
values that must be interpreted locally are converted before the elements are
included. For example, @which
must be interpreted in light of the included
document's keys, not those of the including document. Similarly, different
numeration systems, e.g., Roman numerals, must be interpreted locally and
converted, before inclusion (see the section called “One reference system”).
<source>
s
and <see-also>
sCreating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files.
To answer this, first determine a class 1 file's <source>
s. Everything else is then
a <see-also>
.
If you find that you are changing the material to go back to the source of your
source, then that earlier version should be the <source>
and the file you were
using should be credited under a <see-also>
. But beware, lest using a particular source
(such as the TLG) puts you in violation of contract law (see the section called “Rights and Licenses”).
Some attributes are inheritable attributes, in that they affect not only the host element but all descendants as well. Some inheritable attributes in co-occurrence fall into an interpretive sequence. That is, in any given element, some attributes must be interpreted before others.
@claimant
falls first in
the sequence, and @cert
second.
Each attribute qualifies the data governed by the elements they modify. Put
another way, the two attributes are to be interpreted to mean: "@claimant
has @cert
confidence about the following
data:...."
Suppose you encoding claims made by someone else, and you are not certain if you
are faithfully representing their point of view. In those cases, your doubt should
be registered in a @claimant
and @cert
that is a parent to the secondary claim you are
representing.
If @claimant
is missing,
it is to be assumed that the assertion is being made by the key <agent>
(see the section called “@id and a TAN file's IRI
Name”).
If @cert
is missing, it is
to be assumed that the data is asserted with full confidence.
At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably defined as one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines.
TAN adopts the proximate term token—a word that is defined not linguistically but computationally, according to a regular expression (see the section called “Regular Expressions”).
A TAN token is a reference pointer, not a linguistic marker. To define a token
in TAN does not entail any linguistic commitments. Neither editors nor users of
TAN data should infer that a <tok>
points to a morpheme, a lexeme, or any other
linguistic entity. There will frequently be a fortuitous correlation between the
two, but it is not guaranteed. In TAN, a token is purely a method of
reference.
TAN requires all class 2 files that handle tokens to define them, either
implictly through TAN defaults, or explicitly by using <token-definition>
.
TAN was developed in service of ancient literature, where punctuation is
anomalous, or of little use. Furthermore, even in contemporary use, most people
ignore punctuation when they count words. Therefore the default <token-definition>
defines a token as being any continuous string of word characters, the soft
hyphen, the zero-width space, or the zero-width joiner, formally defined:
<token-definition regex="[\w­​‍]+"/>
This pattern will result in a close resemblance to what is ordinarily thought
of as words, but perhaps with some surprises (see above, the section called “Regular Expressions”). If no <token-definition>
is
invoked for a particular source, the pattern above will be assumed. It may also be
explictly called through @which
(see the section called “TAN keywords for types of token definitions (<token-definition>)”).
If you are working with modern texts, where punctuation might be important to
name and number, try the built-in keyword general
(or letters
and punctuation
):
<token-definition regex="\w+|[^\w\s]"/>
This expression defines a token as a sequence of word characters or any single
character that is neither a word nor a space. The string "(I go!)
"
(the text inside the quotation marks) would have five tokens: ( I go !
)
.
Above are the two built-in, TAN-defined <token-definition>
s.
You may customize your own <token-definition>
to suit your needs. But keep in
mind that TAN files were meant to be shared across fields and disciplines. You are
encouraged to to define tokens in manner customary to users of the text.
Specialized definitions make it less likely that your TAN file will be able to
mesh well with other TAN files. Two class-2 files annotating the same class-1 file
cannot be easily compared or synthesized if they use different definitions of
token.
Given those caveats, consider a specialized case, where you wish to prepare
your transcriptions such that certain Unicode characters precisely delimit tokens
that are synonymous with a particular linguistic category, say lexeme. Say, for
example, you use specialized control characters (e.g., U+200C ZERO WIDTH
NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text
of your class 1 file. You might then create a <token-definition>
like this:
<token-definition regex="[^\p{Cf}\s]+"/>
The statement defines a token as any consecutive sequence of non-spacing and non-control format characters.
Such customized approaches may make the technique unwieldy or impossible to
use, thereby limiting your TAN file's interoperability and utility. It is
recommended that if you use control formatting characters or other special
characters that are invisible to use the xml entity, e.g.,
‍
, so they can be seen in your file.