<head>
)No matter how much one TAN format differs from another, the metadata follows the same basic structure. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore to find easily and predictably, the following:
the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise has an important relationship;
the most significant parts of the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.
To answer these questions completely, consistently, and predictably, the
<head>
, a mandatory child
of the root element, takes a common pattern across all TAN formats, making TAN files
predictable across a variety of formats. The TAN <head>
, intended to be concise and focused, compels you to
provide metadata for the data that is governed by <body>
, but it does not accommodate metadata for the
metadata. TAN metadata centers on the data itself and not on other things. For
example, <head>
requires you name
the people who helped create or edit the data, but you are not expected to tell us
about them. Merely give good <IRI>
s to point to authoritative sources that provide background information.[12]
In what follows we provide a general overview of the TAN <head>
, focusing on its general
structure, and some of the principles that affect other parts of the TAN
ecosystem.
Key information about the file as a whole is the first section of a <head>
. This includes <name>
, perhaps one or more
<desc>
s, and perhaps one
or more <master-location>
s, which point to locations for
authoritative versions. <master-location>
is optional, but not if <to-do>
(see below) is
empty.
Each <head>
in a TAN file
has a declaration section, pertaining to how the file should be used: <license>
and <numerals>
.
<license>
stipulates the
license(s) under which the persons or organizations listed in its @licensor
are releasing the data.
The license applies only to the data in <body>
, not to its
sources. The distinction is important, and helpful. It is much
easier for you to decide and state the rights and license behind your own work
than to speak for others. Declaring who holds what rights over your source(s) may
be not only difficult but risky, and is therefore optional, best handled in a
<desc>
or <comment>
.
When using a TAN file, you should investigate the entire chain of rights. You may find discrepancies between the license of a TAN file and that of its sources. For example, you might create a complete TAN-based lexico-morphological analysis of a 20th-century novel, and legitimately release the TAN data under a public domain license, even though the novel itself is under copyright. Users must be aware of and respect licenses, and know that the license in a TAN file may not be the license of its sources.
TAN adopts the Creative Commons licenses as its default license vocabulary. See the section called “TAN keywords for types of rights (<license>)”.
<numerals>
may be
used to declare whether an ambiguous numeral should be interpreted as an
alphabetic numeral or a Roman numeral (default). See the entry for <numerals>
as well as the
section on numeration
systems.
Many TAN files allow in this section <token-definition>
,
which specifies a definition for tokens, perhaps tailored via @src
to a specific class-2 file. See
the section called “Defining words and tokens” and <token-definition>
.
The third major section of <head>
accommodates links and references to other files. Some
files are essential to processing the TAN file, while others are less
important.
The two most critical types of files are marked by <inclusion>
and <vocabulary>
. The files
pointed to by these elements should be considered constituent parts of the
dependent TAN file. In the validation process, failure to access any one of them
(calculated recursively) is a fatal error.
<inclusion>
and
<vocabulary>
were
developed to reduce duplication (and therefore potential error) in collections of
TAN files. Many if not most TAN files are created alongside or in the context of a
project, where certain data patterns are repeated. Explicit repetition from one
file to the next makes them prone to error. Changes might be made in one file but
not in another, introducing version conflicts. <inclusion>
and <vocabulary>
provide a
specialized method of inclusion that leads to cleaner, smaller files.
In general, you should first try using <vocabulary>
, which points to TAN-voc files that collect
vocabulary items common to the project. If that element does not do what you want,
then try <inclusion>
. It
is normally easier to diagnose a complex set of <vocabulary>
s than a complex
set of <inclusion>
s.
Oftentimes, from one file to the next, an editor needs to refer repeatedly to a common set of things, e.g., manuscripts, works of literature, or persons who helped edit the files.
Projects are advised to create their own <TAN-voc>
files, populated
with commonly used vocabulary. Once set up, the TAN-voc file must be linked to
via a <vocabulary>
in
the <head>
of each TAN file
that draws from the vocabulary. Vocabulary items can then be invoked either by
pointing to <name>
values, or by assigning an @xml:id
to a vocabulary item placed in the <head>
's <vocabulary-key>
. If
you draw upon <name>
,
you may make alterations to capitalization. Hyphens, spaces, and underscores
are treated as interchangeable. Capitalization and spelling of @xml:id
, however, must be
strictly followed.
Vocabulary (TAN-voc) files tend to require frequent change and expansion, so it is recommended that you depend upon only those TAN-voc files that are part of your project, and not those from a different project.
In the host file, any attribute that takes multiple IDrefs, e.g.,
@who
, @type
, @subject
, may take a mixture of
values that refer to numerous vocabulary items via @xml:id
or <name>
. But in these
attributes spaces are reserved to delimit multiple values, which means that if
you refer to a <name>
,
spaces must be replaced with the underscore or hyphen. A @which
in the host file,
however, can take no more than one value, so using spaces is fine.
@id
and @xml:id
are
case-sensitive, and do not allow spaces. @which
and therefore <name>
are not
case-sensitive, and the space, hyphen, and underscore are
equivalent.
If you point to @id
or
@xml:id
you must
respect case and punctuation. If you are pointing to a
<name>
you can
ignore case, and you should probably replace the space with a
_.
TAN includes a number of standard vocabulary (TAN-voc) files for a variety of concepts commonly used in textual scholarship (see Chapter 11, Official TAN vocabularies). Vocabulary items have been defined for more than one hundred types of textual divisions, and any of these can be invoked simply by using their names (see the section called “TAN keywords for types of divisions (<div-type>)”).
<vocabulary>
itself may take @which
, but
only to point to one of the extra TAN vocabularies listed in the section called “TAN vocabulary items for extra vocabularies (<vocabulary>)”. You cannot point to a customized
TAN-voc file via @which
.
This restriction avoids some complexity in the validation routine. See the section called “Extra @n
vocabulary” on how to use this feature.
Files pointed to by <vocabulary>
are considered an essential part of any
TAN file. Failure to find the target file will throw a fatal error during
validation.
Whereas vocabularies do not change the host document, inclusions do. Unlike other forms of inclusion you might be familiar with, TAN inclusion is targeted at select elements, never an entire file. TAN inclusion is a two-step process.
First, a TAN file is linked to, and therefore made available for inclusion,
via <inclusion>
s
(inside <head>
). Like
<vocabulary>
, an <inclusion>
does nothing on its own. It merely points
to a file that is eligible for inclusions. No actual inclusions occur until the
next step.
Second, select parts of the included file are invoked in the dependent file.
To do so, insert an element X in a valid location, but with nothing but
@include
, with one
or more values (space-delimited), each pointing to an @xml:id
values of an <inclusion>
. In the
validation process, that element X will be replaced with all element Xs found
in the inclusion file, resolved recursively, and ignoring duplications (deeply
equal elements).
For example, a TAN-T file might have a <div
include="poem1">
. The validation routine will replace that element with
every rootmost <div>
in the
included file called poem1
.
Any host file that includes elements from another file inherits any
vocabulary associated with the inclusion, and along with it @xml:id
values. This may result
in IDrefs pointing to two or more distinct vocabulary items, which may be a
benefit or a hindrance. Be familiar with the items you are including.
TAN inclusion is very practical for texts. Textual works commonly nest
inside each other. By setting up your class-1 files as a series of inclusions,
you can reduce validation time, both in the file and in class-2 files that
depend upon the transcriptions. See the examples
subdirectory for
a sample of a Gospel of Matthew including the Sermon on the Mount including the
Lord's Prayer.
The inclusion technique is also especially useful for vocabulary (TAN-voc) files. A single master TAN-voc file can include other vocabulary files, each devoted to a particular type of item (e.g., one for works, one for scripta). Project files then need to link merely to the master TAN-voc file.
You can include a TAN file that itself includes other TAN files. Inclusion is recursive. In any recursive system, circularity is fatal. That is true for TAN inclusion as well, but only within the scope of specified element names. It is perfectly legal for two files to include each other, as long as they do not try to include (directly or indirectly) the same elements, or try to consult each other to resolve any vocabulary.
Files pointed to by <inclusion>
are considered an essential part of any TAN
file. Failure to find the target file will throw a fatal error during
validation.
A TAN file may point to a number of other types of files. The more that are
mentioned, the richer the network. <predecessor>
and <successor>
point to
versions of the file that precede and postdate it.
<source>
is another
type of related file, but it may or may not link to another file. In class-2
files <source>
always
points to a class-1 TAN file. In class-1 and class-3 files, <source>
may point either to
a file or to a scriptum (see the section called “Domain model”).
<see-also>
can be
used to point to any file that has some relationship to a TAN file. The
required @relationship
points to one or more <relationship>
vocabulary items. There is no standard TAN vocabulary for relationships.
Normally, when a file-to-file relationship is considered important, it becomes
a full-fledged standard TAN element.
Some TAN formats allow special types of related files (e.g., <redivision>
and
<model>
for
class-1 files). See metadata descriptions under specific classes or formats.
The fourth major section of <head>
, which is optional, consists of <adjustments>
, which
specifies changes that have been made (class 1), or should be made (class 2), to
the sources.
In class-1 files, these consist of <normalization>
s and <replace>
s; see the section called “Normalizing transcriptions”.
Class-2 files allow <skip>
, <rename>
, <equate>
, and <reassign>
as adjustments; see the section called “Class 2 metadata (<head>)”.
<vocabulary-key>
The fifth major part of <head>
, <vocabulary-key>
, allows you to declare any vocabulary
items specific to the file. It also allows you to take vocabulary items existing
in other TAN-voc files (whether defined in <vocabulary>
or standard
TAN vocabulary), and assign them @xml:id
s that are valid only in the current file. Anything in
<vocabulary-key>
, and any TAN-voc files pointed to via
<vocabulary>
,
will overwrite default TAN vocabulary.
These id assignments can be supplemented with <alias>
es, which are used to
assign an id to one or more ids. This practice resembles what text editors do when
naming groups of manuscripts. Each manuscript is given a siglum, say a single
lowercase Greek or Latin letter, and the manuscripts are grouped together into
families, with each family given its own siglum, say an uppercase letter. If the
editor wishes to indicate that a whole family of manuscripts departs from a
particular reading, the family siglum is all that is needed. An <alias>
works much the same way,
and can be used for any vocabulary items. For example, if a textual division can
be legitimately called both a rubric and a heading, you could assign
rubr
and hd
as ids in the <vocabulary-key>
to the
vocabulary items for the rubric and the heading, and then insert <alias
xml:id="rubrichead" idrefs="rubr hd">
. Then, in that file,
<div n="1" type="rubrichead">
would identify that <div>
as being both a rubric and a
head.
Unlike other pointing attributes, the @idrefs
of an <alias>
cannot point to the <name>
value of vocabulary items.
They can refer only to the id values of locally defined instances of @xml:id
. This restriction reduces
confusion, and avoids some complexity in the resolution and validation of a TAN
file.
<alias>
es may recurse,
as long as there is no circularity. That is, @idrefs
in an <alias>
may refer to any
@xml:id
or @id
, not only to a vocabulary item but
to another <alias>
.
In most cases <alias>
should refer to items of the same type. In a few situations mixed groups do not
pose a problem, for example mixing <person>
s, <algorithm>
s, and <organization>
s. TAN
validation will indicate whether mixed typology introduces errors.
Because @xml:id
may not
contain certain types of characters, such as common punctuation marks, and because
<alias>
must be able
to coin unusual ids (especially for grammatical features), @id
may be used instead of @xml:id
in <alias>
.
The sixth section of a <head>
declares who is responsible for the file. It consists
of a <file-resp>
and
one or more <resp>
s. The
persons, organizations, or algorithms pointed to in <file-resp>
must include at
least one who has a tag URN whose namespace matches the namespace in the tag URN
of the root element's @id
.
This requirement strengthens the effort to make sure that each TAN file is
associated with the person or persons who are or were responsible for the file.
<person>
s so
identified by <file-resp>
are called primary agents, and are bound to
the global variable $primary-agents
. If a claim is made in a TAN file, and no
@claimant
is
explicitly declared, it is assumed that the $primary-agents
are making
the claim.
The change log, the seventh section of the <head>
consists of one or more <change>
s, which provide a partial
history of the file. The entire history is calculated from every attribute that
has a date or timeDate value, which can be fetched via the function tan:get-doc-history()
or
the global variable $doc-history
.
The change log is an effective way to communicate with those who might use your
files. In all likelihood, a user will download from the master location a local
copy. You might make changes or updates to your master copy. Anyone depending upon
a copy will be warned, during Schematron validation, of each <change>
that postdates the value
of their @accessed-when
. If you have introduced an important or
disruptive change, you can mark your <change>
with @flag
, that allows the following values: warning
(default value), error
, info
, fatal
. By
marking a change as info
, you lower the level of a change's
importance; error
raises the level. The value fatal
will
halt the validation process in the dependent file altogether.
If you receive change messages during validation, and you want to stop them,
merely update the value of @accessed-when
to the current date.
The last section of a <head>
lists all pending tasks that yet need to be applied to
a file. These are itemized as a list of <comment>
s in <to-do>
. A file with an empty <to-do>
is assumed to be no
longer in progress, so there must be a <master-location>
provided.
Like the change log, the <to-do>
effectively communicates cautionary notes to
those who might use your files. Anyone depending upon a copy will be warned,
during Schematron validation, of each item in the list. The report is not
dependent upon when the file was last consulted (@accessed-when
), because
this is a collection of standing, unresolved issues.
One benefit of <to-do>
is that you can release your material before it is finished. Other users will have
fair warning about what is imperfect or incomplete.
[12] The principles above explain why the TEI extension of TAN requires two
heads, one for TEI and the other for TAN. The <teiHeader>
supports the creation of metadata that has little or no relevance to the
content of <body>
, has its
own unique structure, has very few metadata that are required, and is not
designed to incorporate IRIs. Although <teiHeader>
and TAN's
<head>
overlap in
some respects, they cannot be mapped onto each other. Each has a different
purpose, so both must be retained.