The Principles of TAN Metadata (<head>)

At this point, we have finished four TAN files: two transcriptions, one TAN-A-div file, and one TAN-A-tok file. But we've suppressed the <head> in all of them, until now. Before getting into details, we need first to explain a few TAN principles.

Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> describes the raw data. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the basic ones behind XML), the metadata requirements are stricter than they are for Word documents, HTML, TEI, or other formats you might know better. Scholars who find our file really need to know some essential things before they can responsibly use it. For example, what are the sources we have used? Who produced the data? When? What key assumptions have been made in producing the data? What licenses govern the data? The questions are not difficult to answer, but they are critical, and we should take some time to provide accurate answers.

Some metadata questions are specific to certain formats. For example, in a TAN-A-tok file, we ask what relationship holds between the two sources. But that makes no sense for a TAN-T file. But other questions apply universally across all TAN files, no matter what kind of data. As we go from one TAN format to the next, we need to deal as much we can with similar structures and expectations. This reduces any potential confusion in creating and editing a TAN file, and helps other people using our data to find the information they want. More importantly, what we write in one file might save us some work in another.

The rigorous scholarly requirements for TAN metadata are offset somewhat by another principle that was adopted in the design of TAN, namely, that each format's <head> should focus exclusively upon the data in <body> and not other things. That is to say, in a transcription, we should definitely indicate what our source is. But we should not try to write a catalog entry, or even a structured citation, for the book we have used. We are not library catalogers. Our obligation is merely to point somewhere a reader can get more complete information. The <head> is designed to help us to stay focused on the task and data at hand.

TAN was also designed with the assumption that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer.

Take for example the 1881 book we have used for our first transcription. For the human reader we can say simply something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But computers need a more controlled, predictable syntax before they can be directed to the correct edition of Mother Goose (or rather to a digital surrogate of the edition). The human-readable string is too complex, and syntactically opaque. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below another way to name it.)

URNs are families of formalized naming schemes regulated by a central body (Internet Assigned Numbers Authority, IANA) to ensure permanent, persistent, unique names for various types of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and movies (ISANs), which means that anyone can use them to refer unambiguously to a particular kind of thing.

All URNs are simply names. They don't tell you where an object is. To provide a unique location, however, we have universal resource locators (URLs), which might be much more familiar from daily use of the Internet, e.g., Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor).

Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms can be easily confused, and it is best to disambiguate them by thinking of the last letter in each. URIs/IRIs Incorporate both Locators (URL) and Names (URN).

If those acronyms are confusing, don't worry. For our purposes here, they are pretty much are the same, and from this point onward we'll use merely the term IRI (unless we really mean a location, which we'll call a URL).

IRIs are essential to a system frequently called the semantic web or linked (open) data, an agreed way of writing and processing data that relies upon IRIs and a simple data model. The semantic web allows people to make assertions in a way that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. This allows us to find connections across disciplines and projects, to marshall computers to make inferences we might not make on our own, and to create a network of linked data.

TAN has been designed to be linked-data friendly, and so requires in its <head> almost all data to be representable not just in a human-readable form but also computer-readable, as an IRI.

Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to the people most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar.

Highly familiar: The two books that provide the basis of our transcription are well catalogued and generally known. A number of services provided by librarians provide a controlled IRI vocabulary that can be used by anyone to describe uniquely a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found accurate Library of Congress IRIs for both editions of Mother Goose: and Observe that these two IRIs are also, perhaps confusingly, URLs (locations). If we paste these strings into our browser, we retrieve a record that describes the book. This locator does not lead us to the book itself, only to information about the book. Nevertheless, the Library of Congress has decided to make this URL also a name for the book. Anyone who owns a domain name can designate a URL as a name for an object. And that allows them to set up their server to also return information about the object the IRI names. This subtle ambiguity—that the URL both names an entity and is a location for a webpage—can sometimes be confusing to those who are new to the semantic web, because such URLs name in reality two types of things: an entity and a location to find out more information about that entity.

We now have IRIs for the sources. Let's now find an IRI to name the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which is committed to providing a unique URN for every Wikipedia entry in the major languages. The DBPedia URN in this case is Once again, this is both a name and a locator. It names a specific intangible object, namely a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that name into your browser, you will get back more information about that named object.

Familiar, but only in small circles: We will need to have names for some of the people who edited the file. Here we're not interested in the authors of our books. We are interested in crediting the people who helped make the TAN file. Most people who write and edit our TAN file will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for persons.

Many contributors to TAN files, however, will not be listed in these general databases. In those cases, we can name these participants with an IRI that we "own." We have already done something like this by assigning tag URNs to our four transcriptions (the value of @id in the root element). Our editors can do the same thing. If a student Robin Smith has been helping with proofreading, Robin can take an email address (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about the Robin, but it at least allows us to assign a name that will not be confused as the Robin Smith identified by ISNI as (If we want to go a step further, Robin could mint a URN from a domain name that she owns, and set up a linked data service that offers more information, human- and computer-readable. But this is not required, and it can be a lot of work to maintain.)

Now we come to a more difficult challenge. We have to assign an IRI to the relationship that we claim holds between two text-bearing objects. Making that clear is important, because if we had a different view on how one related to the other, it would probably affect the specifics of our word-for-word alignments.

We are assuming for the sake of illustration that the version published in the 1987 Mother Goose is a direct descendant of the 1881 version. Because no suitable IRI vocabulary yet exists for such concepts, TAN has coined an IRI that can be used by anyone wishing to declare that the second of two sources descends from the first through an unknown number of intermediaries:,2015:bitext-relation:a/x+/b.

We face a similar issue when thinking about text reuse. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are not stable, well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI,,2015:reuse-type:adaptation:general.

In both cases above, we could have come up with our own vocabulary. But the idea here is that we should be sharing a common vocabulary whenever possible. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important but abstract concepts. For other examples of IRIs coined by TAN, see Chapter 9, Official TAN keywords.

Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps even us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we hadn't thought as thoroughly as we could have about the category. If we wish to avoid these kinds of situations, we can assign a randomly generated IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. Uuid URNs are very useful. The likelihood that a randomly generated uuid will be identical to any other uuid is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that uuid URN to name some other object or concept). Numerous free UUID generators can be found online.

To humans, a UUID on its own is meaningless, and rather ugly. But it is a start. We always have the option, later, of adding an IRI. It's perfectly fine to give one object or concept multiple IRIs. But the reverse is never true. One should never use the same IRI to identify more than one object or concept.