The Principles of TAN Metadata (<head>)

At this point, we have finished four TAN files: two transcriptions, one TAN-A-div file, and one TAN-A-tok file. But we've suppressed the <head> in all of them, until now. But before getting into details, we need first to discuss a few principles that TAN relies upon.

Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> contains data that describes the data. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the basic ones behind XML), the metadata requirements are stricter than those of other formats. Scholars who use our data really need to know some essential things before they can responsibly use the data we produce. For example, what are the sources we have used? Who produced the data? When? What key assumptions have been made in producing the data? What rights do other people have to use the data? The questions are not difficult to answer, but they are critical, and we should take the time we need to get correct answers.

Some of these questions are specific to certain types of data. For example, in a TAN-A-tok file, we ask what relationship the two sources hold to each other. But that makes no sense for a TAN-T file. But other questions apply universally across all TAN files, no matter what kind of data. As we go from one TAN format to the next, we need to deal as much we can with similar structures and expectations. This reduces any potential confusion in creating and editing a TAN file, and helps other people using our data to find the information they want. More importantly, what we write in one file might save us some work in another.

The rigorous scholarly requirements for TAN metadata are offset somewhat by another principle that was adopted in the design of TAN, namely, that each format's <head> should focus exclusively upon the data in <body> and not other things. That is to say, in a transcription, we should definitely indicate what our source is. But we should not try to write a catalog entry, or even a structured citation, for the book we have used. We are not library catalogers. Our obligation is merely to point somewhere a reader can get more complete information. The <head> is designed to help us to stay focused on the task and data at hand.

TAN was also designed with the assumption that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen in such a way that the phrase Ring around the Rosie is comprehensible not just to the reader but to the computer, using syntax that a computer can be programmed to act upon.

Take for example the 1881 book we have used for our first transcription. For the human reader we can say simply something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But computers need a more controlled, predictable syntax before they can be directed to the correct edition of Mother Goose (or rather to a digital surrogate of the edition). The human-readable string is too complex, and syntactically opaque. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below other ways to name it.)

URNs are families of formalized naming schemes regulated by a central body (Internet Assigned Numbers Authority, IANA) to ensure that people and organizations can legitimately coin and use permanent, persistent, unique names for various types of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and movies (ISANs), which means that anyone can refer to them unambiguously in a manner that is computer-friendly.

All URNs are simply names. They don't tell you where an object is, just what its name is. To provide a unique location, however, we have universal resource locators (URLs), which might be much more familiar from daily use of the Internet, e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor).

Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms can be easily confused, and it is best to disambiguate them by thinking of the last letter in each. URIs/IRIs Incorporate both Locators (URL) and Names (URN).

IRIs are essential to a system frequently called the semantic web or linked (open) data, an agreed way of writing and processing data that relies upon IRIs and a simple data model to connect them. The semantic web allows independent parties to make assertions about things, and if they happen to use the same IRI vocabulary to describe those things, then we can program computers to make associations between disparate, heterogenous datasets. This allows us to find connections across disciplines and projects, to marshall computers to make inferences we not make on their own, and to create a network of linked data.

TAN has been designed to be linked-data friendly, and so requires in its <head> almost all data to be representable not just in a human-readable form but also computer-readable, as an IRI.

Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to the community of practice most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar.

Highly familiar: The two books that provide the basis of our transcription are well catalogued and generally known. A number of services provided by librarians provide a controlled IRI vocabulary that can be used by anyone to describe uniquely a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found accurate Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs. If we paste these strings into our browser, we retrieve a record that describes the book. This locator does not lead us to the book per se, only to information about the book. Nevertheless, the Library of Congress has decided to coin this URL also as an IRI name for the book. Anyone who owns a domain name can designate a URL as a name for an object. And that allows them to set up their server to also return information about the object the IRI names. This subtle ambiguity—that the URL both names an entity and is a location for a webpage—can sometimes be confusing to those who are new to the semantic web, because such URLs name in reality two types of things: an entity and a location to find out more information about that entity.

We now have IRIs for the sources. Let's now find an IRI to name the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is fortuitous. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which is committed to providing a unique URN for every Wikipedia entry in the major languages. The DBPedia URN for the work we have chosen is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific intangible object, namely a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that name into your browser, you will get back more information about that named object.

Familiar, but only in small circles: We will need to have names for some of the people who edited the file. Here we're not interested in the authors of our books. We are interested in crediting the people who helped make the TAN file. Most people who contribute to the creation of the data file will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for persons.

Many contributors to TAN files, however, will not be listed in these general databases. In these cases, we can assign our own IRI to name these participants. We have already done something like this by assigning tag URNs to our four transcriptions (the value of @id in the root element). We can do the same for our editors. If a student Robin Smith has been helping with proofreading, we can take an email address for Robin (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about the Robin, but it at least allows us to assign a name that will not be confused as the Robin Smith identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, we could mint a URN from a domain name that we own, and set up a linked data service that offers more information, human- and computer-readable, about Robin, but this is not required. And it can be a lot of work to maintain.)

Another example of field-specific IRIs is the concept of relationship between two text-bearing objects. We are assuming for the sake of illustration that the version published in the 1987 Mother Goose is a direct descendant of the 1881 version. Our assumption is important to declare, because if we had a different view on how one related to the other, it would probably affect the specifics of our word-for-word alignments. Because no suitable IRI vocabulary yet exists for such concepts, TAN has coined an IRI that can be used by anyone wishing to declare that the second of two sources descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b.

We face a similar issue when thinking about text reuse. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are not stable, well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI, tag:textalign.net,2015:reuse-type:adaptation:general.

For other examples of IRIs coined by TAN, see Chapter 9, Official TAN keywords.

Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps only to us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we hadn't taken into account other things that should be named. So if we wish to avoid these kinds of situations, we can assign a random IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. These uuid URNs, which are generated by computers through randomizing functions, are very useful. The likelihood that a randomly generated uuid will be identical to any other uuid is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that uuid URN to name some other object or concept). Numerous free UUID generators can be found online.

To humans, a UUID on its own is meaningless, and rather ugly. But it is a good start. We always have the option, later, of adding an IRI. It's perfectly fine to give one object or concept multiple IRIs. But the reverse is never true. One should never use the same IRI to identify more than one object or concept.