<head>
)At this point, we have finished four TAN files: two transcriptions, one TAN-A-div
file, and one TAN-A-tok file. But we've suppressed the <head>
in all of them, until now. But
before getting into details, we need first to discuss a few principles that TAN
relies upon.
Unlike <body>
, which carries
the raw data, <head>
contains
what is oftentimes called metadata. That is, <head>
contains data that describes the data. Because the TAN
format is intended primarily to serve scholars, and because the format is heavily
regulated (that is, there are numerous validation rules that supplement the basic
ones behind XML), the metadata requirements are stricter than those of other formats.
Scholars who use our data really need to know some essential things before they can
responsibly use the data we produce. For example, what are the sources we have used?
Who produced the data? When? What key assumptions have been made in producing the
data? What rights do other people have to use the data? The questions are not
difficult to answer, but they are critical, and we should take the time we need to
get correct answers.
Some of these questions are specific to certain types of data. For example, in a TAN-A-tok file, we ask what relationship the two sources hold to each other. But that makes no sense for a TAN-T file. But other questions apply universally across all TAN files, no matter what kind of data. As we go from one TAN format to the next, we need to deal as much we can with similar structures and expectations. This reduces any potential confusion in creating and editing a TAN file, and helps other people using our data to find the information they want. More importantly, what we write in one file might save us some work in another.
The rigorous scholarly requirements for TAN metadata are offset somewhat by
another principle that was adopted in the design of TAN, namely, that each format's
<head>
should focus
exclusively upon the data in <body>
and not other things. That is to say, in a transcription,
we should definitely indicate what our source is. But we should not try to write a
catalog entry, or even a structured citation, for the book we have used. We are not
library catalogers. Our obligation is merely to point somewhere a reader can get more
complete information. The <head>
is designed to help us to stay focused on the task and data at hand.
TAN was also designed with the assumption that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen in such a way that the phrase Ring around the Rosie is comprehensible not just to the reader but to the computer, using syntax that a computer can be programmed to act upon.
Take for example the 1881 book we have used for our first transcription. For the
human reader we can say simply something like "Kate Greenaway, Mother
Goose, New York, G. Routledge and sons [1881]". But computers need a
more controlled, predictable syntax before they can be directed to the correct
edition of Mother Goose (or rather to a digital surrogate of the
edition). The human-readable string is too complex, and syntactically opaque. A more
computer-friendly identifier would be international standard book numbers (ISBNs),
which distinguish the 1984 version of Mother Goose illustrated
by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
converted into a machine-actionable string called universal resource names (URNs), in
this case urn:isbn:0-671493159
and urn:isbn:0-394865340
.
(Our 1881 version was published before the ISBN program was introduced. We will see
below other ways to name it.)
URNs are families of formalized naming schemes regulated by a central body (Internet Assigned Numbers Authority, IANA) to ensure that people and organizations can legitimately coin and use permanent, persistent, unique names for various types of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and movies (ISANs), which means that anyone can refer to them unambiguously in a manner that is computer-friendly.
All URNs are simply names. They don't tell you where an object is, just what its
name is. To provide a unique location, however, we
have universal resource locators (URLs), which might be much more familiar from daily
use of the Internet, e.g., http://academia.edu
. Like URNs, URLs are also
centrally regulated, with individuals or organizations buying the rights to domain
names from a central registry (usually through a third-party vendor).
Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms can be easily confused, and it is best to disambiguate them by thinking of the last letter in each. URIs/IRIs Incorporate both Locators (URL) and Names (URN).
IRIs are essential to a system frequently called the semantic web or linked (open) data, an agreed way of writing and processing data that relies upon IRIs and a simple data model to connect them. The semantic web allows independent parties to make assertions about things, and if they happen to use the same IRI vocabulary to describe those things, then we can program computers to make associations between disparate, heterogenous datasets. This allows us to find connections across disciplines and projects, to marshall computers to make inferences we not make on their own, and to create a network of linked data.
TAN has been designed to be linked-data friendly, and so requires in its
<head>
almost all data to
be representable not just in a human-readable form but also computer-readable, as an
IRI.
Our first task, then, in writing the <head>
sections of our four TAN files is to look for IRI
vocabulary that will be familiar to the community of practice most likely to use our
files. In trying to find suitable IRIs, we will find that the persons, things, and
concepts we want to describe will range from the highly familiar to the
unfamiliar.
Highly familiar: The two books that provide the
basis of our transcription are well catalogued and generally known. A number of
services provided by librarians provide a controlled IRI vocabulary that can be used
by anyone to describe uniquely a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples.
In our case, we have found accurate Library of Congress IRIs for both editions of
Mother Goose: http://lccn.loc.gov/12032709
and
http://lccn.loc.gov/87042504
. Observe that these two IRIs are also,
perhaps confusingly, URLs. If we paste these strings into our browser, we retrieve a
record that describes the book. This locator does not lead us to the book per se,
only to information about the book. Nevertheless,
the Library of Congress has decided to coin this URL also as an IRI name for the
book. Anyone who owns a domain name can designate a URL as a name for an object. And
that allows them to set up their server to also return information about the object
the IRI names. This subtle ambiguity—that the URL both names an entity and is a
location for a webpage—can sometimes be confusing to those who are new to the
semantic web, because such URLs name in reality two types of things: an entity and a
location to find out more information about that entity.
We now have IRIs for the sources. Let's now find an IRI to name the work,
Ring around the Rosie. The work is widely
known, and even has a Wikipedia
entry. That Wikipedia entry is fortuitous. The Universities of Leipzig and
Mannheim and Openlink Software have collaborated on a project called DBPedia, which is committed to
providing a unique URN for every Wikipedia entry in the major languages. The DBPedia
URN for the work we have chosen is
http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses
. Once again, this
is both a name and a locator. It names a specific intangible object, namely a nursery
rhyme that we've called Ring around the Rosie, no matter what
specific version. But if you put that name into your browser, you will get back more
information about that named object.
Familiar, but only in small circles: We will need to have names for some of the people who edited the file. Here we're not interested in the authors of our books. We are interested in crediting the people who helped make the TAN file. Most people who contribute to the creation of the data file will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for persons.
Many contributors to TAN files, however, will not be listed in these general
databases. In these cases, we can assign our own IRI to name these participants. We
have already done something like this by assigning tag URNs to our four
transcriptions (the value of @id
in
the root element). We can do the same for our editors. If a student Robin Smith has
been helping with proofreading, we can take an email address for Robin (even one that
doesn't work any more) and a date when the email address was used and construct a tag
URN such as tag:smith.robin@example.com,2012:self
. This has a slight
drawback in that we cannot type this string into our browser to find out more about
the Robin, but it at least allows us to assign a name that will not be confused as
the Robin Smith identified by ISNI as
http://isni.org/isni/0000000043306406
. (If we want to go a step
further, we could mint a URN from a domain name that we own, and set up a linked data
service that offers more information, human- and computer-readable, about Robin, but
this is not required. And it can be a lot of work to maintain.)
Another example of field-specific IRIs is the concept of relationship between two
text-bearing objects. We are assuming for the sake of illustration that the version
published in the 1987 Mother Goose is a direct descendant of the
1881 version. Our assumption is important to declare, because if we had a different
view on how one related to the other, it would probably affect the specifics of our
word-for-word alignments. Because no suitable IRI vocabulary yet exists for such
concepts, TAN has coined an IRI that can be used by anyone wishing to declare that
the second of two sources descends from the first through an unknown number of
intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b
.
We face a similar issue when thinking about text reuse. We generally consider the
1987 version to be an adaptation of the 1881 version. And there are not stable,
well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI,
tag:textalign.net,2015:reuse-type:adaptation:general
.
For other examples of IRIs coined by TAN, see Chapter 9, Official TAN keywords.
Generally unfamiliar: Some things or concepts
will be unknown to very few people, perhaps only to us. If we plan to refer to that
thing or concept often, it is preferable to coin a tag URN, as described above. But
in some cases, we might find that a tag URN we minted for some concept or thing was,
in hindsight, misleading or poorly constructed, because we hadn't taken into account
other things that should be named. So if we wish to avoid these kinds of situations,
we can assign a random IRI called a universally unique identifier (UUID), e.g.,
urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0
. These uuid URNs, which
are generated by computers through randomizing functions, are very useful. The
likelihood that a randomly generated uuid will be identical to any other uuid is
astronomically improbable, making them reliably unique names for anything (barring
someone copying and reusing that uuid URN to name some other object or concept).
Numerous free UUID generators can be found online.
To humans, a UUID on its own is meaningless, and rather ugly. But it is a good start. We always have the option, later, of adding an IRI. It's perfectly fine to give one object or concept multiple IRIs. But the reverse is never true. One should never use the same IRI to identify more than one object or concept.