Overall Structure

All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) a prolog with at least two processing instruction nodes; (2) a root element; and (3) a head, a body, and an optional teiHeader and tail.

Prolog and processing instruction nodes: The standard prolog of every XML file must begin the fil: <?xml version="1.0" encoding="UTF-8"?> After that come two processing instructions specifying the two schema files required for validation

The first processing instruction node points to the RELAX-NG schema that declares the major, structural rules. The second points to the finely tuned rules, written in Schematron. Both processing instructions are required. [PATH] represents the pathname to the schema file, whether local or on a server and [ROOT-ELEMENT-NAME] stands for the name of the root element (the element that is the ancestor of all other elements in the document and the descendant of none). It is your choice whether you use .rnc or .rng as the extension for the RELAX-NG schema. The former is the compact syntax and the latter, the XML format. They are equivalent. The schemas are written primarily in the compact sequence, then converted to the XML format.

TAN files admit three different levels of validation: terse, normal, and verbose. A phase may be specified with a pseudoattribute phase in the prolog, e.g., <?xml-model href="TAN-A-div.sch" phase="verbose"?>. But it is customary not to specify the phase, since users will oftentimes wish to change the level of validation. Verbose takes the longest, and terse the shortest. Verbose provides the most feedback, terse the least.

Root element: The name of the root element identifies the type of TAN file:

Table 4.1. Root TAN elements

Root element nameType of dataTAN class
<TAN-T>plain text transcriptions1
<TEI>TEI transcriptions1
<TAN-A-tok>token-based alignments2
<TAN-A-div>division-based alignments2
<TAN-A-lm>lexico-morphological analysis2
<TAN-mor>part of speech / morphology patterns3
<collection>catalog of TAN files3


<collection> is provided here only to complete the table. None of the material in this chapter applies to this special class 3 format. See the section called “TAN Catalog Files (collection)”.

Each root element takes a mandatory @id and @TAN-version. All TAN elements take the namespace tag:textalign.net,2015:ns. In most cases, this value is placed in the root element. (The only exception are TAN-TEI transcription files, which take as a default namespace http://www.tei-c.org/ns/1.0 everywhere but in /TEI/head, which takes the TAN namespace.) For more about namespaces, see the section called “Namespaces”.

Root element children: Most root elements take two mandatory children: <head> and <body>, the latter containing data and the former, metadata (data about the data). TAN-TEI files take a three children: <teiHeader>, <head>, and <text>, because the TEI header does not satisfy TAN expectations. See the section called “Transcriptions Using the Text Encoding Initiative (<TEI>)”.

All TAN files may take one final optional child, <tail>, a private use element that allows any well-formed XML. It was introduced to facilitate more efficient validation. Nothing in a TAN file should be dependent upon the <tail>. That is, if you are editing a TAN file and you add a <tail>, assume that it will be disregarded by other users. Similarly, you may delete any TAN file's <tail> without consequence.

@id and a TAN file's IRI Name

Every TAN file requires in its root element an @id. Its value, termed the TAN file's IRI name, must take the form of a tag URN (see the section called “Tag URNs” for syntax). The file's IRI name is the primary way other TAN files will refer to it.

The namespace of the current file's IRI name must match at least one namespace in one <person>'s <IRI> value. This helps tie the responsibility for the TAN file to at least one person. The first such <person> is called the primary agent, and is bound to the global variable $primary-agent.

In choosing a value for @id you might borrow the filename, but you do not have to. Indeed, it is probably not a good idea, since files are frequently renamed, often with good reason. A TAN file's IRI name should not be changed, especially after publication, because the name is supposed to be permanent and stable.

On occasion during editing, it will become clear that revisions are so deep that the file is altogether a different kind of thing. If a previous version has been published, then coining a new IRI name is advised, to dissociate the file with its ancestry. You may always document the connection by supplying a <see-also> element in the <head>, specifying the <relationship> between the two.

If you take someone else's data and alter it then you should not change the IRI name, even the namespace. To avoid suggesting that the owner of that namespace is responsible for any revisions you make to the file (if you are allowed—see <license>), you should add yourself as an <person> and then document your alterations through <change> or @ed-when and @ed-who. You should also probably add a <see-also> element, pointing to a version of the file that predates your intervention.

The name of the version of a TAN file is identified by the most recent date in a file's @when, @ed-when, or @when-accessed. It is important, therefore, whenever you change a TAN file that has already been published to provide at least an edit stamp (the section called “Edit Stamp”) in the part of the file you changed or in a <comment> or <change>, so that anyone validating a TAN file dependent upon yours will be warned that changes have been made. The user may then either continue to process the file (the changes may be minor on inconsequential) or investigate the changes before deciding what to do.

Because the IRI name is stable, it is suitable for use outside of TAN, in, for example, RDFa, JSON-LD, and linked open data (see the section called “Identifiers and Their Use”).

The IRI name kept at @id is the only metadatum positioned outside <head>. It is placed as rootward in the document as possible to emphasize that it names the entire document.

@TAN-version must be 2018, indicating that the files have been made in light of the development files of version one.