The Principles of TAN Metadata (<head>)

The Principles of TAN Metadata (<head>)
Prev	Chapter 2. Starting off with the TAN Format	Next

The Principles of TAN Metadata (`<head>`)

At this point, we have finished four TAN files: two transcriptions (TAN-T), one macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided discussing the <head> in each of them until now. Before getting into details, some important concepts need to be covered first.

Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> contains data about the data that is in <body>. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the standard XML ones), the metadata requirements are stricter than they are for Word documents, HTML, TEI, or other formats you might know better. Scholars who find our file expect to know some things about it before they can responsibly use it. For example, what are the sources we have used? Who produced the data? When? What changes or adjustments have been made? What licenses govern the use of the data? The questions are not difficult to answer, but they require thought, care, and some time to answer.

Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok file, we ask what relationship holds between the two sources. But that question makes no sense for a TAN-T file, which is merely a transcription. Some questions apply universally across all TAN files, no matter what kind of data. The TAN formats have been designed so that <head> handles common metadata consistently across each format. This reduces potential confusion, and helps other people using our data to find the information they want. More important, what we write in one file can be referenced by another, without duplication, and so will reduce the chance of errors.

Another TAN principle is that each <head> should focus exclusively upon scope of the data in <body>, and not on other things. For example, in a TAN-T file, we are concerned only about the transcription, so our metadata too should be concerned only with the transcription. We should indicate its source, but because our file is not about the source itself, so we don't need to describe it further. We are not library catalogers, nor should we be. A TAN-T file is for transcribing, not for curating bibliographical data. Our obligation is merely to point a reader to complete and authoritative information, found elsewhere.

TAN was also designed under the principle that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer.

Take for example the 1881 book we have used for our first transcription. For the human reader we can write something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But this human-readable string is too complex and syntactically opaque for computers and algorithms. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below another way to name it.)

There are different URNs for different things: journals (via ISSNs, urn:issn:...), articles (DOIs, urn:doi:...), movies (ISANs, urn:isan:...), and so forth, which means that anyone can use them to refer unambiguously to a particular kind of thing. URN naming schemes must be registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent, persistent, unique names for various types of things. (See IANA's registry and the section called “$official-urn-namespaces” for a complete list of official URN schemes.)

All URNs are simply names. They don't tell you where an object is. To provide a unique location, however, we have the perhaps more familiar universal resource locators (URLs), e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor).

Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms are easily confused and conflated, even by veterans. URIs and IRIs are basically the same thing, and they encompass URNs and URLs, a relationship and function that can be remembered by the last letter in each acronym: URIs/IRIs Incorporate both Locators (URL) and Names (URN).

If those acronyms are confusing, don't worry. For our purposes, they are pretty much all the same, and from this point onward we'll stick with the term IRI (unless we really mean a location to find a file, which we'll call a URL).

IRIs are essential to a system frequently called the semantic web or linked (open) data, which relies upon IRIs as the basis for a simple universal data model. The semantic web allows people to make assertions in a way that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. For example, if one scholar claims through IRIs that X is the mother of Y, and another claims in a different dataset that Y is the mother of Z, a computer can infer that X is the grandmother of Z, without the two scholars being aware of each other's work. When many scholars begin to use IRIs in their data, the result is a network that allows us or anyone else to discover connections across disciplines and projects, and make inferences that transcend any single project.

TAN has been designed to be semantic-web friendly, and so requires in its <head> almost all data to be not just human-readable but also computer-readable, normally as an IRI.

Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to those most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar.

Highly familiar: The two books that provide the basis of our transcription are catalogued and generally well known. A number of services provided by librarians provide controlled IRI vocabularies that can be used by anyone to unambiguously identify a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs (locations). If we paste these strings into our Web browser, we retrieve a record that describes the book. This locator does not lead us to the book itself, only to information about the book. Nevertheless, the Library of Congress has decided to make this URL also a name for the book, which means that it does double duty, both as a location for a Web page and a name for a book. Anyone who owns a domain name can designate a URL as a name for an object, a practice that can easily confuse anyone new to the semantic web, because such URLs name in reality two types of things: an entity and a web resource to learn more about that entity. The idea is that hundreds of years from now, when the web page no longer exists, the name will still be valid.

In the TAN system, you can apply as many IRIs to a concept as you like. In fact, it is a good practice to find and add as many IRIs as you think worthwhile, just in case someone can't figure out what you're trying to identify. Just make sure that any IRI you copy unambiguously points to the thing you have in mind.

We now have IRIs for the sources. Let's now find an IRI for the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which provides a unique URN for every Wikipedia entry in the major languages. The DBPedia IRI in this case is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific, intangible, abstract work, namely, a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that IRI into your browser, you will get back more information about that named object.

Familiar to specialists: We will need to have IRIs for some of the people who edited the file. Here we're not interested in the authors of the books we transcribed. We are interested in identifying the people who helped make the TAN file itself. Most people who write and edit TAN files will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for authors, editors, and other persons central to the publications held in the world's libraries.

Most contributors to TAN files, however, will not be listed in these databases. In those cases, we can name these participants with an IRI that we "own." We have already done something like this by assigning tag URNs to our four TAN files (the value of @id in the root element). Our editors can do the same thing. If a student Robin Smith has been helping with proofreading, Robin can take an email address (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about this particular Robin, but it at least allows us to assign a name that will not be confused as another Robin Smith, for example the one identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, Robin could mint a URN from a domain name that she owns, and set up a linked data service that offers more information, human- and computer-readable. But this is not required, and it can be a hassle to set up and maintain.)

Let's take a more difficult challenge for locating an IRI, that of describing the @bitext-relation in our TAN-A-tok file. @bitext-relation draws from the discipline of stemmatology, which studies how manuscripts were copied from each other, and tries to place these manuscripts in a chain of transmission, a kind of historical stemma (tree). We have to find an IRI that describes the relationship that we claim holds between two text-bearing objects. Making that clear is important, because our perspective about the relationship between the two books affects the decisions we make when we align words, and other scholars using our files will want to know the assumptions we had when we aligned the two texts.

For the sake of illustration we posit that the version published in the 1987 Mother Goose is a direct but not immediate descendant of the 1881 version. Because no suitable IRI vocabulary yet exists for the relationships between texts, TAN itself has coined an IRI that can be used by anyone wishing to declare that, given two ordered sources, the second descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b. (The arbitrary symbol / signifies a step from one version to the next, and the x+ represents one or more intermediate versions.) We'll use that one for now.

We face a similar issue when thinking about text reuse, @reuse-type. Here we are concerned with creative activities such as translation, paraphrase, adaptation, and so forth. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are no stable, well-published IRI vocabularies for text reuse. So we adopt an IRI that is part of TAN's standard vocabulary, tag:textalign.net,2015:reuse-type:adaptation:general.

In the previous two cases, we could have come up with our own vocabulary. But the idea behind the semantic web is to use common, familiar vocabulary whenever possible. That's the same principle that drew us to structure and label the poem in four consecutively numbered lines. We adopt conventions we expect others will likely follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important but abstract concepts. For other examples of IRIs coined by TAN, see Chapter 10, Official TAN vocabularies.

Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps even us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we had only superficially thought about the category. If we wish to avoid such situations, we can assign a randomly generated IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. UUID URNs are very useful. The likelihood that a randomly generated UUID will be identical to any existing UUID is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that UUID URN to name some other object or concept). Numerous free UUID generators can be found online.

To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it is a start. We always have the option, later, of supplementing it with other IRIs. It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse is never true. One should never use one IRI to identify more than one object or concept.