<head>
)Now that we have explored various IRI vocabularies for concepts related to our files concerning Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021"
id="tag:parkj@textalign.net,2015:ring01">
<head>
<name>TAN transcription of Ring a Ring o' Roses</name>
<master-location
href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
<license licensor="park">
<IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
<name>Attribution 4.0 International</name>
</license>
<work>
<IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
<name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name>
</work>
<source>
<IRI>http://lccn.loc.gov/12032709</IRI>
<name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name>
</source>
<vocabulary-key>
<person xml:id="park">
<IRI>tag:parkj@textalign.net,2015:self</IRI>
<name>Jenny Park</name>
</person>
<div-type xml:id="line">
<IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
<name>line of poetry</name>
</div-type>
<role xml:id="creator">
<IRI>http://schema.org/creator</IRI>
<name xml:lang="eng">creator</name>
</role>
</vocabulary-key>
<file-resp who="park"/>
<resp roles="creator" who="park"/>
<change when="2014-08-13" who="park">Started file</change>
<to-do/>
</head>
. . . . . . .
</TAN-T>
<name>
, the human readable
counterpart to the @id
that is
inside the root element, can be anything. And we can supply more than one <name>
, in case we wish to provide
alternative names of the file, or translations of them.
One or more <master-location>
s provide URLs where master versions of the
file are kept (and maintained). We provide this as a courtesy to others who might be
using our data. Anyone who validates their local copy of the file will be warned if
it does not match the master version, and they will be told of the most recent
changes. With a couple of keystrokes, they can update their local copy to match the
master. This one-way communication system lets us silently and conveniently notify
other users of changes. We do not have to keep track of who is using our file, and
users do not have to pester us with questions about what changed when.
<master-location>
is mandatory only if we are finished with our to-do list, which is specified at
<to-do>
. If that element is
empty, then we imply that we do not know anything further that should be done to the
file. Conversely, any elements in <to-do>
specify what remains to be done, and details will be
returned to other users. That way you can release data that is useful but not
completely perfect, and let users know about its deficiencies. This approach is ideal
for formats such as TAN-A-tok, where you might have released only some of the data,
and you are working on the rest.
One day the link in <master-location>
will be dead. But perhaps a copy of our
file will be in circulation elsewhere. The document @id
in the root element provides a way to
identify files, independent of links, and perhaps locate them in unexpected
places.
<license>
specifies the
license under which we are releasing our data. This element has nothing to do with
the copyright of the source we have used (although, having been published in 1881,
the book is clearly in the public domain). That is, we are specifying what rights are
attached to the data, not its source, i.e., if we have placed additional strictures
on the content in <body>
. In this
example, we have released the data under a creative commons license. The child
element <IRI>
specifies a Creative
Commons IRI, and <name>
is the
human-readable form.
@licensor
specifies who
has granted the license, in this case our fictive Jenny Park (see below).
The conjunction of <IRI>
and
<name>
, the IRI + name pattern, recurs throughout TAN files. They are
used provide identifiers for vocabulary items. In an
element that takes the IRI + name pattern, we may include as many children
<IRI>
s or <name>
s as we like. But if we do so, we
are stating that they are synonymous, i.e., that they all name the same thing. (Once
again, an IRI is unique, so it should never be used to identify more than one
thing.)
<work>
uses the IRI + name
pattern to name the work we have chosen to transcribe. <source>
points, through its IRI +
name pattern, to a computer- and human-readable description of the book we have
chosen.
<vocabulary-key>
contains vocabulary that we are using in our file. Inside, we can place more
vocabulary items, and attach locally unique ids. For example, an IRI + name pattern
is used for <person>
, which
identifies through a tag URN Jenny Park. The value of @xml:id
allows us to use
park
any time we want to mention Jenny. In fact, we already have, at
@licensor
. Any
mention of park
will point to the appropriate item in <vocabulary-key>
.
There are a few other parts of <vocabulary-key>
. <div-type>
specifies an IRI + name pattern for line
divisions, and the value of @xml:id
means that we can use line
any time we want to
invoke the concept. Similarly, we have a <role>
. The <IRI>
value of <role>
comes from the vocabulary of schema.org, which is maintained by Bing,
Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated
to universal Internet standards), but we could have used Dublin Core or some other
IRI vocabulary describing behaviors, responsibilities, and roles.
After the <vocabulary-key>
, we get into parts of the file that specify who
did what, when. First is a <file-resp>
, whose value of @who
, park
, indicates that
Jenny Park is the one primarily responsible for the file. <resp>
specifies further who was
responsible for doing what.[6]
Remember that <head>
is
focused on the data, not its sources, so the claim that Jenny Park is the creator
pertains only to the data. No inference should be made about who was responsible for
the printed source. If someone wants to know anything about the book, they should
pursue the IRI identifier we have provided under <source>
.
<change>
has attributes
@when
and @who
to specify who made the change and
when. The value of @when
is always
a date or a date + time, formatted according to the ISO standard syntax:
[YYYY]-[MM]-[DD]
or [YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]
.
@who
always carries an IDref
that points to a person or organization. <change>
does not take the IRI + name pattern, or even any
children at all. It takes simply a plain-text description of what changed.
So now we have finished one transcription file's metadata. You may have found it to represent a lot of typing: many names, IRIs, and so forth. Is there any way to shorten that load? Yes, there is. TAN is a vocabulary-based format. That is, there are standard vocabulary items that come with the TAN format, and you can design your own vocabulary, so that you can shorten the work involved, and to adhere to the best DRY principles.
Our second example will look similar to the first one, but notice some shortcuts:
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T>
In this example, <name>
,
<master-location>
, and <source>
have been modified to describe this file. Note, we
haven't had to change <work>
.
<license>
looks different,
but in reality it is identical to our previous example, and that is because the IRI +
name pattern has been replaced with @which
. You may replace any IRI + name pattern with @which
; its value must match a
<name>
in customized or
standard vocabulary (a TAN-voc file). In this case, "by 4.0"
points to
TAN's standard vocabulary for licenses (see the section called “TAN keywords for types of rights (<license>)”). Here is what that looks like under the hood:
<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:licenses"> . . . . . . . <body affects-element="license"> <item> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <IRI>tag:textalign.net,2015:license:by/4.0/</IRI> <name>by 4.0</name> <desc>attribution 4.0 international</desc> </item> . . . . . . . </body> </TAN-voc>
Because the validation rules for TAN-voc files require every <name>
to be unique, that element can
be treated as a unique identifier, similar to @xml:id
. We could have repeated the <license>
from the previous TAN-T
file. But the @which
method is
much quicker, cleaner, and DRY.
Before <vocabulary-key>
comes a new element, <adjustments>
, which contains a
<normalization>
statement whose @which
says
no hyphens
. That too points to a standard TAN vocabulary for
normalizations: an IRI + name pattern for eliminating discretionary hyphens (see
the section called “TAN keywords for types of normalizations (<normalization>)”). Here's what that vocabulary
item looks like (invisible to you, but you can look at it any time you like in the
vocabularies
subdirectory of the TAN files):
<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:normalizations"> . . . . . . . <body affects-element="normalization"> <item> <IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed</IRI> <name>no hyphens</name> <desc>Discretionary word-break line-end hyphens have been deleted.</desc> </item> . . . . . . . </body> </TAN-voc>
As you might have inferred, the element <normalization>
specifies how
we have changed the data, namely, that we have opted to remove word-break line-end
hyphenation. In other transcriptions we could use <normalization>
to declare
other kinds of changes we felt compelled to make, such as removing editorial comments
or footnote signals. A healthy list of <normalization>
s is a courtesy to users of our data, some of
whom might passionately care about keeping or removing line-end hyphenation.
Back to our example. <div-type>
has a new value for @xml:id
, the letter l
, and
in it too the IRI + name pattern has been replaced by @which
, whose value, line (poetry)
, is a
standard vocabulary item (see the section called “TAN keywords for types of divisions (<div-type>)”.[7]
There is a also new <comment>
element, which is built much the same as <change>
. (A <change>
, after all, is just a
comment about what has been changed.)
That seems to be all there is. But if you've been attentive, you will have noticed
that <role>
from our first TAN-T
file (inside <vocabulary-key>
) is missing. That's because we don't need it,
based on the same principle that lets us resolve @which
. A vocabulary <name>
can be invoked not only in @which
, but in any attribute that
points to values of @xml:id
, in
this case @roles
. There is
already a standard TAN vocabulary item with the <name>
creator
, so we can use it directly without having to declare an
intermediate vocabulary item with an @xml:id
. If we had defined something else in <vocabulary-key>
with a
@xml:id
of
creator
, that item would take precedence and override the built-in
TAN vocabulary. But we haven't, so the standard TAN vocabularies are the
default.
[6] If you decide to modify someone else's TAN file, you should credit / blame
yourself for the changes. Your first point of order should be to add a
<person>
to the
<vocabulary-key>
, identifying yourself. You can then
either add a <change>
(see below) or a <resp>
(you might need to specify a <role>
in the <vocabulary-key>
). You
should not change the document's @id
, unless your changes are so significant that it becomes
altogether a new document, your document. TAN does not try
to broker the age-old problem of determining the point at which a thing becomes
something altogether different (e.g., the Ship of Theseus
problem). Use your best intuition.
[7] A line of poetry is to be contrasted with a physical line on the page. Some
lines of poetry take up two or more physical lines. For the physical line you
would specify: which="line (physical)"
.