Metadata (<head>)

Metadata (<head>)
Prev	Chapter 4. Patterns and Structures Common to All TAN Encoding Formats	Next

Metadata (`<head>`)

No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following:

the stable name of the file;
its version;
its sources;
other files upon which it depends or otherwise have an important relationship;
the most significant parts of the editorial history;
the linguistic or scholarly conventions that have been adopted in creating and editing the data;
the license, i.e., who holds what rights to the data, and what kind of reuse is allowed.
the persons, organizations, or entities that helped create the data, and the roles played by each.

To answer these questions completely, consistently, and predictably the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, thus allowing anyone to work easily and predictably across large numbers and types of TAN files. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. That is, your metadata should focus on the data itself and not other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. You merely refer through <IRI> to other authoritative sources that can provide background information.

	Note
	The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. Because of its design principles, the `<teiHeader>` is impossible to map onto a TAN `<head>`. But that `<teiHeader>` has valuable, sometimes critically important, information, and should be retained. Or it may be left empty.

Detailed descriptions of <head> and its components are in Chapter 8, TAN patterns, elements, and attributes defined. Here we provide a summary, general description of TAN metadata.

To describe the current file, <head> takes one or more <name>s, zero or more <desc>s and <master-location>s, and one <rights-excluding-sources>.

Next come a list of files upon which the file depends: zero or more <inclusion>s, zero or more <key>s, zero or more <source>s, and zero or more <see-also>s.

All editorial assumptions are placed in <declarations>, whose contents differ from one TAN format to the next.

Finally comes the responsibility section stating who did what when: one or more <agent>s, <role>s, and <change>s, and zero or more <agentrole>s.

Rights and Licenses

Two TAN elements cover rights and licenses: <rights-excluding-sources> (mandatory in every TAN file) and <rights-source-only> (optional, and never allowed in class 2 files, because a statement on rights is required in each source). The first element covers the work specific to a given TAN file. The second pertains to the rights for the sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional (see below).

As an editor, you are strongly encouraged in the <desc> element of <rights-excluding-sources> to emphasize the distinction between the rights you have over your data and the rights held by others over your source, for the benefit of those who may not be familiar with the TAN format. A statement something like this is recommended: <desc>The data in this file, only insofar as it constitutes an independent work, is licensed exclusive of any licenses held by parties over the source or sources listed below.</desc>

When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the two licenses—that of a TAN file and that of its sources—you should respect the more restrictive license. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions.

It is recommended that you not declare who own what rights over your source unless you are quite certain. Copyright laws differ from one country to another, and they change. A source may be protected by copyright in one place and simultaneously be in the public domain in another. (At the time of this writing, dozens of scholarly editions of ancient texts are in the public domain in Germany, where copyright of a new edition lasts forty years, but not in the U.S. or Canada, where there is no explicit legislation on this issue.) Some copyright statements in books are false, or cannot be proven. Some persons or entities who claim rights over a source may have no legal basis for the claim, at least in some jurisdictions. Furthermore, if you mischaracterize the rights that are held over a source, you may be held liable by a putative rights holder. It is safer to use the <IRI> of <source> (described below) to point the user to a publisher or some other entitiy that has greater authority and specificity about who owns what rights.

TAN adopts the Creative Commons licenses as its default key vocabulary. See the section called “TAN keywords for types of rights (<rights-excluding-sources><rights-source-only>)”.

	Copyright Law versus Contract Law
Some third-party services, such as the Thesaurus Linguae Graecae for Greek texts, require users to agree not to copy and reuse the texts in service's databases. Such agreements fall under the area of contract law and not copyright law. That is, many of these third parties have no intellectual property rights (or only derivative rights) over the texts they store. Therefore, they should normally not be credited in any `<rights-source-only>`.

Some third-party services, such as the Thesaurus Linguae Graecae for Greek texts, require users to agree not to copy and reuse the texts in service's databases. Such agreements fall under the area of contract law and not copyright law. That is, many of these third parties have no intellectual property rights (or only derivative rights) over the texts they store. Therefore, they should normally not be credited in any <rights-source-only>.

Inclusions and Keys

Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Such repetition makes the files prone to errors, where editorial corrections made in one place are mistakenly not made everywhere. TAN has two features that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files.

Keys

Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor. Writing individual IRI + name patterns can be time-consuming, and if a change needs to be made, it is easy to be inconsistent or incomplete.

Vocabulary commonly used in a project may be kept in a <TAN-key> file. This file is made accessible to any other TAN file via <key>. The key vocabulary is then invoked by using @which, whose value should match a <name> value in the TAN-key file.

A number of standard keys have already been predefined, documented in Chapter 9, Official TAN keywords. It is strongly recommended that you not depend upon the supplementary TAN-key files of a different project. Rather you should develop your own. You may also wish to create a workflow where the TAN-key is used for private editing, but the published versions have their keywords resolved to their full value.

Inclusions

More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file.

As with keys, TAN inclusion is a two-step process. First, a TAN file is made available for inclusion by invoking <inclusion>s (inside <head>). Like <key>, an <inclusion> does nothing on its own. It merely indicates a file that may be used for patterned inclusions.

Inclusions are acted upon only in the second step. Many elements allow @include, which points to the @xml:id reference of an included file. In the validation process, those elements will be replaced with every element of that name found in the inclusion file, checked recursively (see below), and ignoring duplicated elements.

<inclusion>s are critically important to the content of the TAN file, so any file with <inclusion>s that cannot be located will be regarded as being in fatal error. Because of the importance of access to included files, it is strongly recommended that inclusions be limited to files locally available, in the same project.

Inclusions are recursive. If a TAN file A has <x include='B'> and file B has <x include='C D E'> then the validator for file A will replace the element with all <x>s found in B, C, D, and E.

In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within the domain of a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include elements of the same name.

TAN inclusion removes elements from their original context, which means that values that must be interpreted locally are converted before the elements are included. For example, @which must be interpreted in light of the included document's keys, not those of the including document. Similarly, different numeration systems, e.g., Roman numerals, must be interpreted locally and converted, before inclusion (see the section called “One reference system”).

Distinguishing `<source>`s and `<see-also>`s

Creating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files.

To answer this, first determine a class 1 file's <source>s. Everything else is then a <see-also>.

If you find that you are changing the material to go back to the source of your source, then that earlier version should be the <source> and the file you were using should be credited under a <see-also>. But beware, lest using a particular source (such as the TLG) puts you in violation of contract law (see the section called “Rights and Licenses”).

Interpretation of inheritable attributes

Some attributes are inheritable attributes, in that they affect not only the host element but all descendants as well. Some inheritable attributes in co-occurrence fall into an interpretive sequence. That is, in any given element, some attributes must be interpreted before others.

@claimant falls first in the sequence, and @cert second. Each attribute qualifies the data governed by the elements they modify. Put another way, the two attributes are to be interpreted to mean: "@claimant has @cert confidence about the following data:...."

Suppose you encoding claims made by someone else, and you are not certain if you are faithfully representing their point of view. In those cases, your doubt should be registered in a @claimant and @cert that is a parent to the secondary claim you are representing.

If @claimant is missing, it is to be assumed that the assertion is being made by the key <agent> (see the section called “@id and a TAN file's IRI Name”).

If @cert is missing, it is to be assumed that the data is asserted with full confidence.

Defining Words and Tokens

At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably defined as one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines.

TAN adopts the proximate term token—a word that is defined not linguistically but computationally, according to a regular expression (see the section called “Regular Expressions”).

A TAN token is a reference pointer, not a linguistic marker. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. In TAN, a token is purely a method of reference.

TAN requires all class 2 files that handle tokens to define them, either implictly through TAN defaults, or explicitly by using <token-definition>. TAN was developed in service of ancient literature, where punctuation is anomalous, or of little use. Furthermore, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters, the soft hyphen, the zero-width space, or the zero-width joiner, formally defined:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+"/>

This pattern will result in a close resemblance to what is ordinarily thought of as words, but perhaps with some surprises (see above, the section called “Regular Expressions”). If no <token-definition> is invoked for a particular source, the pattern above will be assumed. It may also be explictly called through @which (see the section called “TAN keywords for types of token definitions (<token-definition>)”).

If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword general (or letters and punctuation):

<token-definition regex="\w+|[^\w\s]"/>

This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string "(I go!)" (the text inside the quotation marks) would have five tokens: ( I go ! ).

Above are the two built-in, TAN-defined <token-definition>s. You may customize your own <token-definition> to suit your needs. But keep in mind that TAN files were meant to be shared across fields and disciplines. You are encouraged to to define tokens in manner customary to users of the text. Specialized definitions make it less likely that your TAN file will be able to mesh well with other TAN files. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different definitions of token.

Given those caveats, consider a specialized case, where you wish to prepare your transcriptions such that certain Unicode characters precisely delimit tokens that are synonymous with a particular linguistic category, say lexeme. Say, for example, you use specialized control characters (e.g., U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text of your class 1 file. You might then create a <token-definition> like this:

<token-definition regex="[^\p{Cf}\s]+"/>

The statement defines a token as any consecutive sequence of non-spacing and non-control format characters.

Such customized approaches may make the technique unwieldy or impossible to use, thereby limiting your TAN file's interoperability and utility. It is recommended that if you use control formatting characters or other special characters that are invisible to use the xml entity, e.g., ‍, so they can be seen in your file.

Prev	Up	Next
Overall Structure (root)	Home	Chapter 5. Class-1 TAN Files, Representations of Textual Objects (Scripta)