Metadata (<head>)

No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following:

  1. the stable name of the file;

  2. its version;

  3. its sources;

  4. other files upon which it depends or otherwise have an important relationship;

  5. the most significant parts of the editorial history;

  6. the linguistic or scholarly conventions that have been adopted in creating and editing the data;

  7. the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.

  8. the persons, organizations, or entities that helped create the data, and the roles played by each.

To answer these questions completely, consistently, and predictably the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, thus allowing anyone to easily and predictably work across large numbers and types of TAN files. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. That is, your metadata should focus on the data itself and not other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. Merely give good <IRI>s that point to authoritative sources that provide background information.

[Note]Note

The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. <teiHeader> is impossible to map onto a TAN <head>. But that <teiHeader> has valuable, sometimes critically important, information, and should be retained, or replaced with a valid but empty skeleton.

Detailed descriptions of <head> and its components are in Chapter 8, TAN patterns, elements, and attributes defined. Here we provide a summary, general description of TAN metadata.

To describe the current file, <head> takes one or more <name>s, zero or more <desc>s and <master-location>s, one <license>.

Next come a list of files upon which the file depends: zero or more <inclusion>s, zero or more <key>s, zero or more <source>s, and zero or more <see-also>s.

All editorial assumptions are placed in <definitions>, whose contents differ from one TAN format to the next.

Finally comes the responsibility section stating who did what when: one or more <person>s, <role>s, and <change>s, and zero or more <resp>s.

Two TAN elements cover rights and licenses: <license> (mandatory in every TAN file) and <licensor>. The first element defines the license under which you are releasing your data; the second specifies who has licensed the data.

The license applies only to the file itself, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional (see below).

When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the license of a TAN file and that of its sources you should respect the more restrictive one. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions.

If you wish to indicate what license governs a source, use <desc> in <source>.

TAN adopts the Creative Commons licenses as its default key vocabulary. See the section called “TAN keywords for types of rights (<license>)”.

Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another. TAN has two features—keys and inclusions—that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files.

In general, you should first work with keys. If they are not doing the job you need, then try inclusions.

Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor, roles, and division types.

Projects are advised to create their own <TAN-key> files populated with commonly used vocabulary.

Using those files is a two-step process. First, the TAN-key file is declared via <key>. Second, elements (normally in <definitions>) can take @which instead of the customary IRI + name pattern. @which points to a <name> in the TAN-key file.

TAN includes a number of standard TAN-key files located at http://textalign.net/release/TAN-2018/TAN-key/ and documented in Chapter 9, Official TAN keywords. Any element that takes @which can take full advantage of those files, without <key>.

It is strongly recommended that you depend upon only TAN-key files you have written, and not those of a different project.

More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file. As with keys, TAN inclusion is a two-step process.

First, a TAN file is made available for inclusion via <inclusion>s (inside <head>). Like <key>, an <inclusion> does nothing on its own. It merely indicates a file that may be used for inclusions.

Second, elements that allow it make take @include, which points to the @xml:id reference of the <inclusion>. In the validation process, those elements will be replaced with every element of that name found in the inclusion file, checked recursively (see below), and ignoring duplicated elements.

<inclusion>s are critically important to the content of the TAN file, so any file with <inclusion>s that cannot be located will be regarded as being in fatal error. Because of the importance of access to included files, it is strongly recommended that inclusions be limited to files locally available, in the same project.

Inclusions are recursive. If a TAN file A has <x include='B'> and file B has <x include='C D E'> then file A will be given all <x>s found in B, C, D, and E.

In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include the same elements.

TAN inclusion removes elements from their original context, which means that values that must be interpreted locally are converted before the elements are included. For example, @which must be interpreted in light of the included document's keys, not those of the including document. Similarly, different numeration systems, e.g., Roman numerals, must be interpreted locally and converted, before inclusion (see the section called “One reference system”).

Creating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files.

To answer this, first determine a class 1 file's <source>. Everything else is then a <see-also>.

If you find in the course of editing that you are starting to depend upon the source of your source, then that earlier version should be credited as the <source> and the file you were using should be moved to <see-also>.

Many attributes are not inheritable, e.g., @xml:id. Others are inheritable, indicating something about the host element and all its descendants. When a descendant has the same attribute, the default behavior is for the new attribute to cancel any inherited ones, e.g., @xml:lang, @affects-element, @claimant. In other cases, the inherited effect is additive, e.g., @cert. Consult individual attribute entries to understand an attribute's behavior.

Some attributes in an element have priority for interpretation. @claimant, for example, has priority over @cert second. That is, the two attributes in the same element are to be interpreted to mean: "@claimant has @cert confidence about the following claim:...."

At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably taken to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines.

TAN adopts the proximate term token—a word that is defined not according to grammar but according to a regular expression (see the section called “Regular Expressions”).

A TAN token is a reference pointer, not a linguistic marker. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. In TAN, a token is purely a method of reference.

TAN was developed in service of ancient literature, where punctuation is generally ignored as being late or not central to the text. Even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters, the soft hyphen, the zero-width space, or the zero-width joiner, formally defined:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+"/>

This pattern will result in a close resemblance to what is ordinarily thought of as words, but perhaps with some surprises (see above, the section called “Regular Expressions”). If no <token-definition> is explicitly given, the pattern above will be assumed.

If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword letters and punctuation:

<token-definition regex="\w+|[^\w\s]"/>

This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string "(I go!)" (the text inside the quotation marks) would have five tokens: ( I go ! ).

Above are two built-in, TAN-defined <token-definition>s. You may customize your own <token-definition> to suit your needs. But keep in mind that TAN files were meant to be shared across fields and disciplines. You are encouraged to to define tokens in manner customary to users of the text. Specialized definitions make it less likely that your TAN file will be able to mesh well with other TAN files. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different definitions of token.

Given those caveats, consider a specialized case, where you wish to prepare your transcriptions such that certain Unicode characters precisely delimit tokens that are synonymous with a particular linguistic category, say lexeme. Say, for example, you use specialized control characters (e.g., U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text of your class 1 file. You might then create a <token-definition> like this:

<token-definition regex="[^\p{Cf}\s]+"/>

The statement defines a token as any consecutive sequence of non-spacing and non-control format characters.

Such customized approaches may make the technique unwieldy or impossible to use, thereby limiting your TAN file's interoperability and utility. It is recommended that if you use control formatting characters or other special characters that are invisible to use the xml entity, e.g., &#x200D;, so they can be seen in your file.