All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) a prolog normally with at two processing instruction nodes; (2) a root element; and (3) a head, a body, and an optional teiHeader and tail.
Prolog and processing instruction nodes: The
standard prolog of every XML file should begin: <?xml version="1.0"
encoding="UTF-8"?>
[10]
After that come two processing instructions specifying the two schema files required for validation
<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
c]"?>
<?xml-model href="[PATH]/TAN.sch"?>
The first processing instruction node points to the RELAX-NG schema that declares
the major, structural rules. The second points to the finely tuned rules, written in
Schematron. Both processing instructions are required, except in systems where those
processing instructions are implicitly understood (e.g., an Oxygen project or
framework). [PATH]
represents the pathname to the schema file, whether
local or on a server, and [ROOT-ELEMENT-NAME]
stands for the name of the
file's root element (the element that is the ancestor of all other elements in the
document and the descendant of none). It is your choice whether you use
.rnc
or .rng
as the extension for the RELAX-NG schema.
The former is the compact syntax and the latter, the XML format. They are equivalent.
The schemas are written initially in the compact sequence, then converted to the XML
format.
TAN files permit three different levels of Schematron validation:
terse
, normal
, and verbose
. A phase may be
specified with a pseudoattribute phase
in the prolog, e.g.,
<?xml-model href="TAN.sch" phase="verbose"?>
. But it is customary
not to specify the phase, since most users will want to pick the level of validation
desired at a given time. Verbose takes the longest time, and terse the shortest.
Verbose provides the most feedback, terse the least. But some files will not show any
difference in results from one phase to the next. For more on validation, see the section called “TAN validation”.
Root element: The name of the root element identifies the type of TAN file:
Table 4.1. Root TAN elements
Root element name | Type of data | TAN class |
---|---|---|
<TAN-T> | plain text transcriptions | 1 |
<TEI> | TEI transcriptions | 1 |
<TAN-A> | division-based alignments and annotations | 2 |
<TAN-A-tok> | token-based alignments | 2 |
<TAN-A-lm> | lexico-morphological annotations | 2 |
<TAN-mor> | part of speech / morphology patterns | 3 |
<TAN-voc> | glossaries | 3 |
<collection> | catalog of TAN files | 3 |
Each root element takes a mandatory @id
and @TAN-version
. On @id
, see below. @TAN-version
must be 2021
, the current version of
TAN.
All TAN elements fall under the namespace tag:textalign.net,2015:ns
.
In most cases, the namespace is declared in the root element. (The only exceptions
are TAN-TEI transcription files, which take as a default namespace
http://www.tei-c.org/ns/1.0
everywhere but in /TEI/head
,
which takes the TAN namespace.) For more about namespaces, see the section called “Namespaces”.
Root element children: Most root elements take two mandatory
children: <head>
and <body>
, the latter containing data and
the former, metadata (data about the data). Root elements of TAN-TEI files take three
children: <teiHeader>
, <head>
, and <text>
. The apparent duplication
of a head element is necessary: the <teiHeader>
does not satisfy TAN
metadata requirements, and the TAN header does not try to do what the teiHeader does.
See the section called “Transcriptions using the Text Encoding Initiative (<TEI>
)”.
All TAN files may take one final optional child, <tail>
, a private use element that allows any
well-formed XML. It was introduced initially to experiment with methods in improving
the efficiency of validation and applications, but it can be used for a variety of
tasks or applications. Nothing in a TAN file should be dependent upon the <tail>
. That is, if you are editing
a TAN file and you add a <tail>
,
assume that it will be disregarded by other users. Similarly, you may delete any TAN
file's <tail>
without
consequence.
@id
Every TAN file requires in its root element an @id
, which must take the form of a tag
URN (see the section called “Tag URNs” for syntax). The file's @id
is the primary way other TAN files
will refer to it, and it may be used in RDFa, JSON-LD, and linked open data (see
the section called “Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)”).
A tag URN begins with a namespace component, and concludes with the identifying
string. The namespace of @id
must
match at least one other tag URN namespace from the <IRI>
of a <person>
identified by <file-resp>
. See the section called “Responsibility”.
In choosing a value for @id
you might imitate the filename, but this is normally not a good idea, since files
are frequently renamed, often with good reason. A TAN file's @id
should not be changed, especially
after public release. The name should remain permanent and stable, even if flaws
in the name are recognized.
On occasion during editing, it will become clear that revisions are so deep
that the file is altogether a different kind of thing. If a previous version has
been published, then coining a new @id
is advised, to make a clean break. You may document the
connection by supplying <predecessor>
, which establishes a line of
ancestry.
If you take someone else's data and alter it then you should not change the @id
. To ensure that you are credited with any revisions you
make to the file (if you are allowed—see <license>
), you should add yourself as a <person>
and then document your
alterations through <change>
or @ed-when
and @ed-who
. You might also add a
<predecessor>
element, pointing to the previous version of the file.
The @id
is the only
file-specific metadatum positioned outside <head>
. It is placed as rootward in the document as
possible to make clear that it names the entire document.
The version of a TAN file is identified by the most recent date in a file's
@when
, @ed-when
, and @accessed-when
.
Whenever you change a TAN file that has already been published, provide at
least an edit stamp (the section called “Edit stamp”) in the part of the file you
changed, or add a new <comment>
or <change>
, so that anyone validating a TAN file dependent
upon yours will be warned that changes have been made. The user may then either
continue to process the file (the changes may be minor or inconsequential) or
pause and see if anything on their end needs to be changed.
[10] XML version 1.1 is a permissible alternative, and
encoding="UTF-8"
is optional.
[11] <collection>
is
provided here only to complete the table. None of the material in this chapter
applies to this special class 3 format. See the section called “TAN Catalog Files (collection
)”.