<head>
)No matter how much one TAN format differs from another, the metadata follows the same basic structure. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore to find easily and predictably, the following:
the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise has an important relationship;
the most significant moments in the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.
To answer these questions completely, consistently, and predictably the
<head>
, a mandatory child
of the root element, takes a common pattern across all TAN formats, making work
across large numbers and types of TAN files predictable. The TAN <head>
, intended to be concise and
focused, compels you to provide metadata for the data that is governed by <body>
, but it does not accommodate
metadata for the metadata. That is, your metadata should focus on the data itself and
not on other things. For example, <head>
requires you name the people who helped create or edit
the data, but you are not expected to tell us about them. Merely give good
<IRI>
s to point to
authoritative sources that provide background information.
Note | |
---|---|
The principles above explain why the TEI extension of TAN requires two
heads, one for TEI and the other for TAN. The |
In what follows we provide a general overview of the TAN <head>
, focusing on its general
structure, and some of the principles that affect other parts of the TAN
ecosystem.
Key information about the file as a whole is the first section of a <head>
. This includes <name>
, perhaps one or more
<desc>
s, and perhaps one
or more <master-location>
s, which point to locations for
authoritative versions. <master-location>
is optional, but not if <to-do>
(see below) is
empty.
Each <head>
in a TAN file
has a declaration section, pertaining to how the file should be used: <license>
and <numerals>
.
<license>
stipulates the
license(s) under which the persons listed in its @licensor
are releasing the data.
The license applies only to the data in <body>
, not to its
sources. The distinction is important, and helpful. It is much
easier for you to decide and state the rights and license behind your own work
than to do so for that of others. Declaring who holds what rights over your
source(s) may be not only difficult but risky, and is therefore optional, best
handled in a <desc>
or
<comment>
.
When using a TAN file, you should investigate the entire chain of rights. You may find discrepancies between the license of a TAN file and that of its sources. For example, you might create a thorough lexico-morphological analysis of a 20th-century novel, and legitimately release the TAN data under a public domain license, even though the novel itself is under copyright. Users must be aware of and respect licenses, and know that the license in a TAN file may not be the license of its sources.
TAN adopts the Creative Commons licenses as its default license vocabulary. See the section called “TAN keywords for types of rights (<license>)”.
<numerals>
may be
used to declare whether an ambiguous numeral should be interpreted as an
alphabetic numeral or a Roman numeral (default). See the entry for <numerals>
as well as the
section on numeration
systems.
Many TAN files allow in this section <token-definition>
,
which specifies a definition for tokens, perhaps tailored via @src
to a specific class-2 file. See
the section called “Defining Words and Tokens” and <token-definition>
.
The third major section of <head>
accommodates links and references to other files. Some
files are essential to processing the TAN file, while others are less
important.
The two most critical types of files are marked by <inclusion>
and <vocabulary>
. The files
pointed to by these elements should be considered constituent parts of the
dependent TAN file. In the validation process, failure to access one will be
treated as a fatal error.
<inclusion>
and
<vocabulary>
were
developed to reduce duplication (and therefore potential error) in collections of
TAN files. Many if not most TAN files are created alongside or in the context of a
project, where certain data patterns are repeated. Explicit repetition from one
file to the next makes them prone to error. Changes might be made in one file but
not in another, introducing version conflicts. <inclusion>
and <vocabulary>
provide a
specialized method of inclusion that leads to cleaner, smaller files.
In general, you should first try using <vocabulary>
. If that element does not do what you want,
then try <inclusion>
. It
is normally easier to diagnose a complex set of <vocabulary>
s than a complex
set of <inclusion>
s.
Oftentimes, from one file to the next, an editor needs to refer repeatedly to a common set of things, e.g., manuscripts, works of literature, or persons who helped edit the files.
Projects are advised to create their own <TAN-voc>
files populated
with commonly used vocabulary. Once set up, the TAN-voc file must be linked to
via a <vocabulary>
in
the <head>
of other TAN
files. Vocabulary items can then be invoked either by pointing to <name>
values, or by assigning
an @xml:id
to a vocabulary
item placed in the <head>
's
<vocabulary-key>
. If you draw upon <name>
, you may make
alterations to capitalization, and hyphens, spaces, and underscores are treated
as interchangeable. Capitalization and spelling of @xml:id
, however, must be
strictly followed.
Vocabulary (TAN-voc) files tend to require frequent change and expansion, so it is recommended that you depend upon only those TAN-voc files that are part of your project, and not those from a different project.
In the host file, any attribute that takes multiple IDrefs, e.g.,
@who
, @type
, @subject
, may take mix
references to vocabulary items via @xml:id
or <name>
, but because in such attributes spaces are
reserved to delimit multiple values, in the case of the latter, any space in a
<name>
must be
replaced with the underscore or hyphen. A @which
in the host file, however, can take no more
than one value, so using spaces is fine.
@id
and @xml:id
are
case-sensitive, and do not allow spaces. @which
and therefore <name>
are not
case-sensitive, and the space, hyphen, and underscore are
equivalent.
If you point to @id
or
@xml:id
you must
respect case and punctuation. If you are pointing to a
<name>
you can
ignore case, and you should probably replace the space with a
_.
TAN includes a number of standard vocabulary (TAN-voc) files for a variety of concepts commonly used in textual scholarship (see Chapter 10, Official TAN vocabularies). For example, there are more than one hundred types of textual divisions that can be invoked simply by using their names (see the section called “TAN keywords for types of divisions (<div-type>)”).
<vocabulary>
itself may take @which
, but
only to point to one of the extra TAN vocabularies listed in the section called “TAN vocabulary items for extra vocabularies (<vocabulary>)”. This restriction avoids some
complexity in the validation routine. See the section called “Extra @n
vocabulary”
on how to use this feature.
Files pointed to by <vocabulary>
are considered an essential part of any
TAN file. Failure to find the target file will throw a fatal error during
validation.
Whereas vocabularies do not change the host document, inclusions do. Unlike other forms of inclusion you might be familiar with, TAN inclusion is targeted at select elements, never an entire file. TAN inclusion is a two-step process.
First, a TAN file is linked to, and therefore made available for inclusion,
via <inclusion>
s
(inside <head>
). Like
<vocabulary>
, an <inclusion>
does nothing on its own. It merely points
to a file that is eligible for inclusions. No actual inclusions occur until the
next step.
Second, select parts of the included file are invoked in the dependent file.
To do so, insert an element X in a valid location, but with nothing but
@include
, with one
or more values (space-delimited) pointing to the @xml:id
values of the <inclusion>
s desired. In the
validation process, that element will be replaced with every element X found in
the inclusion file, resolved recursively, and ignoring duplications (deeply
equal elements).
For example, a TAN-T file might have a <div
include="poem1">
. The validation routine will replace that element with
every rootmost <div>
in the
included file called poem1
.
Any host file that includes elements from another file inherits any
associated vocabulary, and along with it @xml:id
values. This may result in errors if there are any
resultant conflicts in IDrefs.
TAN inclusion is very practical for texts. Textual works commonly nest
inside each other. By setting up your class-1 files as a series of inclusions,
you can reduce validation time, both in the file and in class-2 files that
depend upon the transcriptions. See the examples
subdirectory for
a case of the Gospels including the Sermon on the Mount including the Lord's
Prayer.
The inclusion technique is also especially useful for vocabulary (TAN-voc) files. A single master TAN-voc file can include other vocabulary files, each devoted to a particular type of item (e.g., one for works, one for scripta). Project files then need to link merely to the master TAN-voc file.
You can include a TAN file that itself includes other TAN files. Inclusion is recursive. In any recursive system, circularity is fatal. That is true for TAN inclusion as well, but only within the scope of specified element names. It is perfectly legal for two files to include each other, as long as they do not try to include (directly or indirectly) the same elements, or try to consult each other to resolve any vocabulary.
Files pointed to by <inclusion>
are considered an essential part of any TAN
file. Failure to find the target file will throw a fatal error during
validation.
Other files can be specified. The more that are mentioned, the richer the
network. <predecessor>
and <successor>
point to
versions of the file that precede and postdate it.
<source>
is another
type of related file, but it may or may not link to another file. In class-2
files <source>
always
points to a class-1 TAN file. In class-1 and class-3 files, <source>
may point either to
a file or to a scriptum (see the section called “Domain model”).
<see-also>
can be
used to point to any file that has some relationship to a TAN file. The
required @relationship
points to one or more <relationship>
vocabulary items. There is no standard TAN vocabulary for relationships.
Normally, when a file-to-file relationship is considered important, it becomes
a full-fledged element.
Some TAN formats allow special types of related files (e.g., <redivision>
and
<model>
for
class-1 files). See metadata descriptions under specific classes or formats.
The fourth major section of <head>
, which is optional, consists of <adjustments>
, which
specifies changes that have been made (class 1), or should be made (class 2), to
the sources.
In class-1 files, these consist of <normalization>
s and <replace>
s; see the section called “Normalizing Transcriptions”.
Class-2 files allow <skip>
, <rename>
, <equate>
, and <reassign>
as adjustments; see the section called “Class 2 Metadata (<head>)”.
<vocabulary-key>
The fifth major part of <head>
, <vocabulary-key>
, allows you to declare any specific
vocabulary items relevant for the file. It also allows you to take vocabulary
items existing in other TAN-voc files (whether defined in <vocabulary>
or standard
TAN vocabulary), and assign them @xml:id
s that are valid only in the current file. Anything in
<vocabulary-key>
will overwrite default TAN vocabulary, but
not any TAN-voc files pointed to via <vocabulary>
.
These id assignments can be supplemented with <alias>
es, which are used to
assign an id to one or more ids. This practice resembles what text editors do when
naming groups of manuscripts. Each manuscript is given a siglum, say a single
lowercase Greek or Latin letter, and the manuscripts are grouped together into
families, with each family given its own siglum, say an uppercase letter. If the
editor wishes to indicate that a whole family of manuscripts departs from a
particular reading, the family siglum is all that is needed. An <alias>
works much the same way,
and can be used for any vocabulary items. For example, if a textual division can
be legitimately called both a rubric and a heading, you could assign
rubr
and hd
as ids in the <vocabulary-key>
to the
vocabulary items for the rubric and the heading, and then insert <alias
xml:id="rubrichead" idrefs="rubr hd">
. Then, in that file,
<div n="1" type="rubrichead">
would identify that <div>
as being both a rubric and a
head.
Unlike other similar attributes, the @idrefs
of an <alias>
cannot point to the <name>
value of vocabulary items.
They can point only to the id references of locally defined instances of
@xml:id
. This
restriction reduces confusion, and avoids some complexity in the resolution and
validation of a TAN file.
<alias>
es may recurse,
as long as there is no circularity. That is, @idrefs
in an <alias>
may refer to any
@xml:id
or @id
, not only to a vocabulary item but
to another <alias>
.
In most cases <alias>
should refer to items of the same type. In a few situations mixed groups do not
pose a problem, for example mixing <person>
s, <algorithm>
s, and <organization>
s. TAN
validation will indicate whether mixed typology introduces errors.
Because @xml:id
may not
contain certain types of characters, such as common punctuation marks, and because
<alias>
must be able
to coin unusual ids (especially for grammatical features), @id
may be used instead of @xml:id
in <alias>
.
The sixth section of a <head>
declares who is responsible for the file. It consists
of a <file-resp>
and
one or more <resp>
s. The
persons, organizations, or algorithms pointed to in <file-resp>
must include at
least one who has a tag URN whose namespace matches the namespace in the tag URN
of the root element's @id
.
This requirement strengthens the effort to make sure that each TAN file is
associated with the person or persons who are or were responsible for the file.
<person>
s so
identified by <file-resp>
are called primary agents, and are bound to
the global variable $primary-agents
. If a claim is made in a TAN file, and no
@claimant
is
explicitly declared, it is assumed that the $primary-agents
are making
the claim.
The change log, the seventh section of the <head>
consists of one or more <change>
s, which provide a partial
history of the file. The entire history is calculated from every attribute that
has a date or timeDate value, which can be fetched via the function tan:get-doc-history()
or
the global variable $doc-history
.
The change log is an effective way to communicate to those who might use your
files. In all likelihood, a user will download from the master location a local
copy. You might make changes or updates to your master copy. Anyone depending upon
a copy will be warned, during Schematron validation, of each <change>
that postdates the value
of their @accessed-when
. If you have introduced an important or
disruptive change, you can mark your <change>
with @flag
, that allows the following values: warning
(default value), error
, info
, fatal
. By
marking a change as info
, you lower the level of a change's
importantce; error
raises it. The value fatal
will halt
the validation process in the dependent file altogether.
If you receive change messages during validation, and you want to stop them,
merely update the value of @accessed-when
to the current date.
The last section of a <head>
lists all pending tasks that yet need to be applied to
a file. These are itemized as a list of <comment>
s in <to-do>
. A file with an empty <to-do>
is assumed to be no
longer in progress, so there must be a <master-location>
provided.
Like the change log, the <to-do>
effectively communicates cautionary notes to
those who might use your files. Anyone depending upon a copy will be warned,
during Schematron validation, of each item in the list. The report is not
dependent upon when the file was last consulted (@accessed-when
), because
these are standing, unresolved issues.
One benefit of <to-do>
is that a clear account of what remains to be done will encourage people to
release their material earlier than normal, because other users will have fair
warning about what is imperfect or incomplete.