Metadata (<head>)

Metadata (<head>)
Prev	Chapter 4. Common patterns and structures	Next

Metadata (`<head>`)

No matter how much one TAN format differs from another, the metadata follows the same basic structure. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore to find easily and predictably, the following:

the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise has an important relationship;
the most significant parts of the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.

To answer these questions completely, consistently, and predictably, the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, making TAN files predictable across a variety of formats. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. TAN metadata centers on the data itself and not on other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. Merely give good <IRI>s to point to authoritative sources that provide background information.^[12]

In what follows we provide a general overview of the TAN <head>, focusing on its general structure, and some of the principles that affect other parts of the TAN ecosystem.

Key Information

Key information about the file as a whole is the first section of a <head>. This includes <name>, perhaps one or more <desc>s, and perhaps one or more <master-location>s, which point to locations for authoritative versions. <master-location> is optional, but not if <to-do> (see below) is empty.

Key Declarations

Each <head> in a TAN file has a declaration section, pertaining to how the file should be used: <license> and <numerals>.

<license> stipulates the license(s) under which the persons or organizations listed in its @licensor are releasing the data. The license applies only to the data in <body>, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to speak for others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional, best handled in a <desc> or <comment>.

When using a TAN file, you should investigate the entire chain of rights. You may find discrepancies between the license of a TAN file and that of its sources. For example, you might create a complete TAN-based lexico-morphological analysis of a 20th-century novel, and legitimately release the TAN data under a public domain license, even though the novel itself is under copyright. Users must be aware of and respect licenses, and know that the license in a TAN file may not be the license of its sources.

TAN adopts the Creative Commons licenses as its default license vocabulary. See the section called “TAN keywords for types of rights (<license>)”.

<numerals> may be used to declare whether an ambiguous numeral should be interpreted as an alphabetic numeral or a Roman numeral (default). See the entry for <numerals> as well as the section on numeration systems.

Many TAN files allow in this section <token-definition>, which specifies a definition for tokens, perhaps tailored via @src to a specific class-2 file. See the section called “Defining words and tokens” and <token-definition>.

Networked Files

The third major section of <head> accommodates links and references to other files. Some files are essential to processing the TAN file, while others are less important.

The two most critical types of files are marked by <inclusion> and <vocabulary>. The files pointed to by these elements should be considered constituent parts of the dependent TAN file. In the validation process, failure to access any one of them (calculated recursively) is a fatal error.

<inclusion> and <vocabulary> were developed to reduce duplication (and therefore potential error) in collections of TAN files. Many if not most TAN files are created alongside or in the context of a project, where certain data patterns are repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another, introducing version conflicts. <inclusion> and <vocabulary> provide a specialized method of inclusion that leads to cleaner, smaller files.

In general, you should first try using <vocabulary>, which points to TAN-voc files that collect vocabulary items common to the project. If that element does not do what you want, then try <inclusion>. It is normally easier to diagnose a complex set of <vocabulary>s than a complex set of <inclusion>s.

Vocabularies

Oftentimes, from one file to the next, an editor needs to refer repeatedly to a common set of things, e.g., manuscripts, works of literature, or persons who helped edit the files.

Projects are advised to create their own <TAN-voc> files, populated with commonly used vocabulary. Once set up, the TAN-voc file must be linked to via a <vocabulary> in the <head> of each TAN file that draws from the vocabulary. Vocabulary items can then be invoked either by pointing to <name> values, or by assigning an @xml:id to a vocabulary item placed in the <head>'s <vocabulary-key>. If you draw upon <name>, you may make alterations to capitalization. Hyphens, spaces, and underscores are treated as interchangeable. Capitalization and spelling of @xml:id, however, must be strictly followed.

Vocabulary (TAN-voc) files tend to require frequent change and expansion, so it is recommended that you depend upon only those TAN-voc files that are part of your project, and not those from a different project.

In the host file, any attribute that takes multiple IDrefs, e.g., @who, @type, @subject, may take a mixture of values that refer to numerous vocabulary items via @xml:id or <name>. But in these attributes spaces are reserved to delimit multiple values, which means that if you refer to a <name>, spaces must be replaced with the underscore or hyphen. A @which in the host file, however, can take no more than one value, so using spaces is fine.

@id and @xml:id are case-sensitive, and do not allow spaces. @which and therefore <name> are not case-sensitive, and the space, hyphen, and underscore are equivalent.

If you point to @id or @xml:id you must respect case and punctuation. If you are pointing to a <name> you can ignore case, and you should probably replace the space with a _.

TAN includes a number of standard vocabulary (TAN-voc) files for a variety of concepts commonly used in textual scholarship (see Chapter 11, Official TAN vocabularies). Vocabulary items have been defined for more than one hundred types of textual divisions, and any of these can be invoked simply by using their names (see the section called “TAN keywords for types of divisions (<div-type>)”).

<vocabulary> itself may take @which, but only to point to one of the extra TAN vocabularies listed in the section called “TAN vocabulary items for extra vocabularies (<vocabulary>)”. You cannot point to a customized TAN-voc file via @which. This restriction avoids some complexity in the validation routine. See the section called “Extra @n vocabulary” on how to use this feature.

Files pointed to by <vocabulary> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Inclusions

Whereas vocabularies do not change the host document, inclusions do. Unlike other forms of inclusion you might be familiar with, TAN inclusion is targeted at select elements, never an entire file. TAN inclusion is a two-step process.

First, a TAN file is linked to, and therefore made available for inclusion, via <inclusion>s (inside <head>). Like <vocabulary>, an <inclusion> does nothing on its own. It merely points to a file that is eligible for inclusions. No actual inclusions occur until the next step.

Second, select parts of the included file are invoked in the dependent file. To do so, insert an element X in a valid location, but with nothing but @include, with one or more values (space-delimited), each pointing to an @xml:id values of an <inclusion>. In the validation process, that element X will be replaced with all element Xs found in the inclusion file, resolved recursively, and ignoring duplications (deeply equal elements).

For example, a TAN-T file might have a <div include="poem1">. The validation routine will replace that element with every rootmost <div> in the included file called poem1.

Any host file that includes elements from another file inherits any vocabulary associated with the inclusion, and along with it @xml:id values. This may result in IDrefs pointing to two or more distinct vocabulary items, which may be a benefit or a hindrance. Be familiar with the items you are including.

TAN inclusion is very practical for texts. Textual works commonly nest inside each other. By setting up your class-1 files as a series of inclusions, you can reduce validation time, both in the file and in class-2 files that depend upon the transcriptions. See the examples subdirectory for a sample of a Gospel of Matthew including the Sermon on the Mount including the Lord's Prayer.

The inclusion technique is also especially useful for vocabulary (TAN-voc) files. A single master TAN-voc file can include other vocabulary files, each devoted to a particular type of item (e.g., one for works, one for scripta). Project files then need to link merely to the master TAN-voc file.

You can include a TAN file that itself includes other TAN files. Inclusion is recursive. In any recursive system, circularity is fatal. That is true for TAN inclusion as well, but only within the scope of specified element names. It is perfectly legal for two files to include each other, as long as they do not try to include (directly or indirectly) the same elements, or try to consult each other to resolve any vocabulary.

Files pointed to by <inclusion> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Other related files

A TAN file may point to a number of other types of files. The more that are mentioned, the richer the network. <predecessor> and <successor> point to versions of the file that precede and postdate it.

<source> is another type of related file, but it may or may not link to another file. In class-2 files <source> always points to a class-1 TAN file. In class-1 and class-3 files, <source> may point either to a file or to a scriptum (see the section called “Domain model”).

<see-also> can be used to point to any file that has some relationship to a TAN file. The required @relationship points to one or more <relationship> vocabulary items. There is no standard TAN vocabulary for relationships. Normally, when a file-to-file relationship is considered important, it becomes a full-fledged standard TAN element.

Some TAN formats allow special types of related files (e.g., <redivision> and <model> for class-1 files). See metadata descriptions under specific classes or formats.

Adjustments

The fourth major section of <head>, which is optional, consists of <adjustments>, which specifies changes that have been made (class 1), or should be made (class 2), to the sources.

In class-1 files, these consist of <normalization>s and <replace>s; see the section called “Normalizing transcriptions”.

Class-2 files allow <skip>, <rename>, <equate>, and <reassign> as adjustments; see the section called “Class 2 metadata (<head>)”.

Local vocabulary items and ID assignments: `<vocabulary-key>`

The fifth major part of <head>, <vocabulary-key>, allows you to declare any vocabulary items specific to the file. It also allows you to take vocabulary items existing in other TAN-voc files (whether defined in <vocabulary> or standard TAN vocabulary), and assign them @xml:ids that are valid only in the current file. Anything in <vocabulary-key>, and any TAN-voc files pointed to via <vocabulary>, will overwrite default TAN vocabulary.

These id assignments can be supplemented with <alias>es, which are used to assign an id to one or more ids. This practice resembles what text editors do when naming groups of manuscripts. Each manuscript is given a siglum, say a single lowercase Greek or Latin letter, and the manuscripts are grouped together into families, with each family given its own siglum, say an uppercase letter. If the editor wishes to indicate that a whole family of manuscripts departs from a particular reading, the family siglum is all that is needed. An <alias> works much the same way, and can be used for any vocabulary items. For example, if a textual division can be legitimately called both a rubric and a heading, you could assign rubr and hd as ids in the <vocabulary-key> to the vocabulary items for the rubric and the heading, and then insert <alias xml:id="rubrichead" idrefs="rubr hd">. Then, in that file, <div n="1" type="rubrichead"> would identify that <div> as being both a rubric and a head.

Unlike other pointing attributes, the @idrefs of an <alias> cannot point to the <name> value of vocabulary items. They can refer only to the id values of locally defined instances of @xml:id. This restriction reduces confusion, and avoids some complexity in the resolution and validation of a TAN file.

<alias>es may recurse, as long as there is no circularity. That is, @idrefs in an <alias> may refer to any @xml:id or @id, not only to a vocabulary item but to another <alias>.

In most cases <alias> should refer to items of the same type. In a few situations mixed groups do not pose a problem, for example mixing <person>s, <algorithm>s, and <organization>s. TAN validation will indicate whether mixed typology introduces errors.

Because @xml:id may not contain certain types of characters, such as common punctuation marks, and because <alias> must be able to coin unusual ids (especially for grammatical features), @id may be used instead of @xml:id in <alias>.

Responsibility

The sixth section of a <head> declares who is responsible for the file. It consists of a <file-resp> and one or more <resp>s. The persons, organizations, or algorithms pointed to in <file-resp> must include at least one who has a tag URN whose namespace matches the namespace in the tag URN of the root element's @id.

This requirement strengthens the effort to make sure that each TAN file is associated with the person or persons who are or were responsible for the file. <person>s so identified by <file-resp> are called primary agents, and are bound to the global variable $primary-agents. If a claim is made in a TAN file, and no @claimant is explicitly declared, it is assumed that the $primary-agents are making the claim.

Change log

The change log, the seventh section of the <head> consists of one or more <change>s, which provide a partial history of the file. The entire history is calculated from every attribute that has a date or timeDate value, which can be fetched via the function tan:get-doc-history() or the global variable $doc-history.

The change log is an effective way to communicate with those who might use your files. In all likelihood, a user will download from the master location a local copy. You might make changes or updates to your master copy. Anyone depending upon a copy will be warned, during Schematron validation, of each <change> that postdates the value of their @accessed-when. If you have introduced an important or disruptive change, you can mark your <change> with @flag, that allows the following values: warning (default value), error, info, fatal. By marking a change as info, you lower the level of a change's importance; error raises the level. The value fatal will halt the validation process in the dependent file altogether.

If you receive change messages during validation, and you want to stop them, merely update the value of @accessed-when to the current date.

Pending work

The last section of a <head> lists all pending tasks that yet need to be applied to a file. These are itemized as a list of <comment>s in <to-do>. A file with an empty <to-do> is assumed to be no longer in progress, so there must be a <master-location> provided.

Like the change log, the <to-do> effectively communicates cautionary notes to those who might use your files. Anyone depending upon a copy will be warned, during Schematron validation, of each item in the list. The report is not dependent upon when the file was last consulted (@accessed-when), because this is a collection of standing, unresolved issues.

One benefit of <to-do> is that you can release your material before it is finished. Other users will have fair warning about what is imperfect or incomplete.

^[12]The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. The <teiHeader> supports the creation of metadata that has little or no relevance to the content of <body>, has its own unique structure, has very few metadata that are required, and is not designed to incorporate IRIs. Although <teiHeader>and TAN's <head> overlap in some respects, they cannot be mapped onto each other. Each has a different purpose, so both must be retained.

Prev	Up	Next
Defining words and tokens	Home	Chapter 5. Class-1 TAN files, representations of textual objects (scripta)