<head>
)At this point, we have finished four TAN files: two transcriptions (TAN-T), one
macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided
discussing the <head>
in each of
them until now. Before getting into details, some important concepts need to be
covered first.
Unlike <body>
, which carries
the raw data, <head>
contains
what is oftentimes called metadata. That is,
<head>
contains data about
the data that is in <body>
.
Because the TAN format is intended primarily to serve scholars, and because the
format is heavily regulated (that is, there are numerous validation rules that
supplement the standard XML ones), the metadata requirements are stricter than they
are for Word documents, HTML, TEI, or other formats you might know better. Scholars
who find our file expect to know some things about it before they can responsibly use
it. For example, what are the sources we have used? Who produced the data? When? What
changes or adjustments have been made? What licenses govern the use of the data? The
questions are not difficult to answer, but they require thought, care, and some time
to answer.
Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok
file, we ask what relationship holds between the two sources. But that question makes
no sense for a TAN-T file, which is merely a transcription. Some questions apply
universally across all TAN files, no matter what kind of data. The TAN formats have
been designed so that <head>
handles common metadata consistently across each format. This reduces potential
confusion, and helps other people using our data to find the information they want.
More important, what we write in one file can be referenced by another, without
duplication, and so will reduce the chance of errors.
Another TAN principle is that each <head>
should focus exclusively upon scope of the data in
<body>
, and not on other
things. For example, in a TAN-T file, we are concerned only about the transcription,
so our metadata too should be concerned only with the transcription. We should
indicate its source, but because our file is not about the source itself, so we don't
need to describe it further. We are not library catalogers, nor should we be. A TAN-T
file is for transcribing, not for curating bibliographical data. Our obligation is
merely to point a reader to complete and authoritative information, found
elsewhere.
TAN was also designed under the principle that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer.
Take for example the 1881 book we have used for our first transcription. For the
human reader we can write something like "Kate Greenaway, Mother
Goose, New York, G. Routledge and sons [1881]". But this human-readable
string is too complex and syntactically opaque for computers and algorithms. A more
computer-friendly identifier would be international standard book numbers (ISBNs),
which distinguish the 1984 version of Mother Goose illustrated
by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
converted into a machine-actionable string called universal
resource names (URNs), in this case
urn:isbn:0-671493159
and urn:isbn:0-394865340
. (Our 1881
version was published before the ISBN program was introduced. We will see below
another way to name it.)
There are different URNs for different things: journals (via ISSNs,
urn:issn:...
), articles (DOIs, urn:doi:...
), movies
(ISANs, urn:isan:...
), and so forth, which means that anyone can use
them to refer unambiguously to a particular kind of thing. URN naming schemes must be
registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent,
persistent, unique names for various types of things. (See IANA's registry and the section called “$official-urn-namespaces
”
for a complete list of official URN schemes.)
All URNs are simply names. They don't tell you where an object is. To provide a
unique location, however, we have the perhaps more
familiar universal resource locators (URLs), e.g., http://academia.edu
. Like URNs,
URLs are also centrally regulated, with individuals or organizations buying the
rights to domain names from a central registry (usually through a third-party
vendor).
Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms are easily confused and conflated, even by veterans. URIs and IRIs are basically the same thing, and they encompass URNs and URLs, a relationship and function that can be remembered by the last letter in each acronym: URIs/IRIs Incorporate both Locators (URL) and Names (URN).
If those acronyms are confusing, don't worry. For our purposes, they are pretty much all the same, and from this point onward we'll stick with the term IRI (unless we really mean a location to find a file, which we'll call a URL).
IRIs are essential to a system frequently called the semantic web or linked (open) data, which relies upon IRIs as the basis for a simple universal data model. The semantic web allows people to make assertions in a way that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. For example, if one scholar claims through IRIs that X is the mother of Y, and another claims in a different dataset that Y is the mother of Z, a computer can infer that X is the grandmother of Z, without the two scholars being aware of each other's work. When many scholars begin to use IRIs in their data, the result is a network that allows us or anyone else to discover connections across disciplines and projects, and make inferences that transcend any single project.
TAN has been designed to be semantic-web friendly, and so requires in its
<head>
almost all data to
be not just human-readable but also computer-readable, normally as an IRI.
Our first task, then, in writing the <head>
sections of our four TAN files is to look for IRI
vocabulary that will be familiar to those most likely to use our files. In trying to
find suitable IRIs, we will find that the persons, things, and concepts we want to
describe will range from the highly familiar to the unfamiliar.
Highly familiar: The two books that provide the
basis of our transcription are catalogued and generally well known. A number of
services provided by librarians provide controlled IRI vocabularies that can be used
by anyone to unambiguously identify a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples.
In our case, we have found Library of Congress IRIs for both editions of
Mother Goose: http://lccn.loc.gov/12032709
and
http://lccn.loc.gov/87042504
. Observe that these two IRIs are also,
perhaps confusingly, URLs (locations). If we paste these strings into our Web
browser, we retrieve a record that describes the book. This locator does not lead us
to the book itself, only to information about the
book. Nevertheless, the Library of Congress has decided to make this URL also a name
for the book, which means that it does double duty, both as a location for a Web page
and a name for a book. Anyone who owns a domain name can designate a URL as a name
for an object, a practice that can easily confuse anyone new to the semantic web,
because such URLs name in reality two types of things: an entity and a web resource
to learn more about that entity. The idea is that hundreds of years from now, when
the web page no longer exists, the name will still be valid.
In the TAN system, you can apply as many IRIs to a concept as you like. In fact, it is a good practice to find and add as many IRIs as you think worthwhile, just in case someone can't figure out what you're trying to identify. Just make sure that any IRI you copy unambiguously points to the thing you have in mind.
We now have IRIs for the sources. Let's now find an IRI for the work, Ring around the Rosie. The work is widely known, and even
has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of
Leipzig and Mannheim and Openlink Software have collaborated on a project called
DBPedia, which provides a
unique URN for every Wikipedia entry in the major languages. The DBPedia IRI in this
case is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses
. Once again,
this is both a name and a locator. It names a specific, intangible, abstract work,
namely, a nursery rhyme that we've called Ring around the Rosie,
no matter what specific version. But if you put that IRI into your browser, you will
get back more information about that named object.
Familiar to specialists: We will need to have IRIs for some of the people who edited the file. Here we're not interested in the authors of the books we transcribed. We are interested in identifying the people who helped make the TAN file itself. Most people who write and edit TAN files will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for authors, editors, and other persons central to the publications held in the world's libraries.
Most contributors to TAN files, however, will not be listed in these databases. In
those cases, we can name these participants with an IRI that we "own." We have
already done something like this by assigning tag URNs to our four TAN files (the
value of @id
in the root element).
Our editors can do the same thing. If a student Robin Smith has been helping with
proofreading, Robin can take an email address (even one that doesn't work any more)
and a date when the email address was used and construct a tag URN such as
tag:smith.robin@example.com,2012:self
. This has a slight drawback in
that we cannot type this string into our browser to find out more about this
particular Robin, but it at least allows us to assign a name that will not be
confused as another Robin Smith, for example the one identified by ISNI as
http://isni.org/isni/0000000043306406
. (If we want to go a step
further, Robin could mint a URN from a domain name that she owns, and set up a linked
data service that offers more information, human- and computer-readable. But this is
not required, and it can be a hassle to set up and maintain.)
Let's take a more difficult challenge for locating an IRI, that of describing the
@bitext-relation
in our TAN-A-tok file. @bitext-relation
draws from the discipline of stemmatology,
which studies how manuscripts were copied from each other, and tries to place these
manuscripts in a chain of transmission, a kind of historical stemma (tree). We have
to find an IRI that describes the relationship that we claim holds between two
text-bearing objects. Making that clear is important, because our perspective about
the relationship between the two books affects the decisions we make when we align
words, and other scholars using our files will want to know the assumptions we had
when we aligned the two texts.
For the sake of illustration we posit that the version published in the 1987
Mother Goose is a direct but not immediate descendant of the
1881 version. Because no suitable IRI vocabulary yet exists for the relationships
between texts, TAN itself has coined an IRI that can be used by anyone wishing to
declare that, given two ordered sources, the second descends from the first through
an unknown number of intermediaries:
tag:textalign.net,2015:bitext-relation:a/x+/b
. (The arbitrary symbol
/
signifies a step from one version to the next, and the
x+
represents one or more intermediate versions.) We'll use that one
for now.
We face a similar issue when thinking about text reuse, @reuse-type
. Here we are concerned
with creative activities such as translation, paraphrase, adaptation, and so forth.
We generally consider the 1987 version to be an adaptation of the 1881 version. And
there are no stable, well-published IRI vocabularies for text reuse. So we adopt an
IRI that is part of TAN's standard vocabulary,
tag:textalign.net,2015:reuse-type:adaptation:general
.
In the previous two cases, we could have come up with our own vocabulary. But the idea behind the semantic web is to use common, familiar vocabulary whenever possible. That's the same principle that drew us to structure and label the poem in four consecutively numbered lines. We adopt conventions we expect others will likely follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important but abstract concepts. For other examples of IRIs coined by TAN, see Chapter 10, Official TAN vocabularies.
Generally unfamiliar: Some things or concepts
will be unknown to very few people, perhaps even us. If we plan to refer to that
thing or concept often, it is preferable to coin a tag URN, as described above. But
in some cases, we might find that a tag URN we minted for some concept or thing was,
in hindsight, misleading or poorly constructed, because we had only superficially
thought about the category. If we wish to avoid such situations, we can assign a
randomly generated IRI called a universally unique identifier (UUID), e.g.,
urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0
. UUID URNs are very
useful. The likelihood that a randomly generated UUID will be identical to any
existing UUID is astronomically improbable, making them reliably unique names for
anything (barring someone copying and reusing that UUID URN to name some other object
or concept). Numerous free UUID generators can be found online.
To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it is a start. We always have the option, later, of supplementing it with other IRIs. It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse is never true. One should never use one IRI to identify more than one object or concept.