<head>
)Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:
<head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location>ring-o-roses.eng.1881.xml</master-location> <rights-excluding-sources rights-holder="park"> <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI> <name>This data file is licensed under a Creative Commons Attribution 4.0 International License. The license is granted independent of any rights and licenses that may be associated with the source. </name> </rights-excluding-sources> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <declarations> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </agent> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> <change when="2014-08-13" who="park">Started file</change> </head>
<name>
is the human readable
form of the @id
that is inside the
root element, <TAN-T>
. It can be
anything. And we can supply more than one <name>
, in case we wish to provide it in different languages
or variations.
<master-location>
is mandatory only if we have claimed through @in-progress
that the file is no
longer in progress. One or more of these elements provide URLs where master versions
of the file are kept (and updated). They may be absolute URLs, such as an address on
the Internet, or it may be a relative URL, in case we are working exclusively on our
local computer. We provide this as a courtesy to others who might be using our data.
If someone downloads a copy and starts working with it, then whenever they validate
the file, if it does not match the one in the master version, a warning is returned,
along with a message or a location of the elements that were last changed. This
allows users to found out if changes have been made, and it allows us to make
corrections and silently notify other users of our alterations. To communicate this,
we do not have to keep track of who is using the file.
<rights-excluding-sources>
contains information about rights
to the data we are releasing. This element has nothing to do with the copyright of
the source we have used (although, having been published in 1881, the book is clearly
in the public domain). This once again gets to the TAN metadata principle of
describing our data and not other things. We have the option to describe the license
of the source we have used (see the rest of the guidelines for guidance), but we
absolutely must declare whether we have placed additional scrictures on the dataset
we have created. That is, we are declaring the rights attached to the data, not its
source. In this example, we have released the data under a creative commons license.
The child element <IRI>
specifies
the IRI assigned by Creative Commons, and <desc>
describes it in human-readable format.
The conjunction of <IRI>
and
<name>
, the IRI +
name pattern, is a recurrent feature of TAN files. We may include any
number of <IRI>
or <name>
elements in an IRI + name
pattern. But if we do so, we are stating that they all name the same thing, not
different things.
<source>
points, through its
IRI + name pattern, to a computer- and human-readable description of the book we have
chosen.
<declarations>
contains data that is specific to TAN file types, to declare the assumptions we have
made relevant to the kind of data we have created. In this case, because we are
working with transcriptions, we have two major components: <work>
and <div-type>
.
<work>
uses the IRI + name
pattern to name the work we have chosen to transcribe. <div-type>
specifies the type of
divisions we have chosen to use to segment the transcription. In a more complex text,
there would be several <div-type>
s. Each one has an @xml:id
, which takes as a value some
nickname that we wish to use for @type
values of <div>
s.
The IRI + name pattern is also used for <agent>
, which describes who was involved in creating the
data, and <role>
. We may have as
many <agent>
s and <role>
s as we wish. The
agent
in this case, Jenny Park, has been given a tag URI. The
<IRI>
value of <role>
comes from the vocabulary of
schema.org, which is maintained by
Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization
dedicated to universal Internet standards), but we could have used Dublin Core or
some other IRI vocabulary describing behaviors, responsibilities, and roles.
Note | |
---|---|
If you decide to modify someone else's TAN file, then you become responsible
for changes, not the original person or organization. Your first point of order
should be add an |
Remember that <head>
is
focused on the data, not its sources, so the claim that Jenny Park is the creator
pertains only to the data. No inference should be made about who created the source.
If someone wants that information, or anything else about the source, they should
pursue the identifier we have provided under <source>
.
<change>
has attributes
@when
and @who
that specify who made the
change/comment and when. The value of @when
is always a date plus optional time formatted according to
the standard YYYY-MM-DD
+ time (optional). @who
always carries a value that refers
to an agent/@xml:id
. Both
<change>
(as well as
<comment>
, missing here)
lack any IRIs, mainly because the likelihood that the data would ever be reused,
repeated, or linked to is altogether too remote to be make a mandated <IRI>
useful.
So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of nice shortcuts:
<head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <rights-excluding-sources which="by-nc-nd_2.0" rights-holder="park"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <declarations> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <div-type xml:id="l" which="half-line (verse)"/> <filter> <normalization which="no hyphens"/> </filter> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> </head>
One significant difference is that three of the elements that normally take the
the section called “IRI + name Pattern” have been replaced with a simpler form that
takes merely @which
and
@xml:id
. That is because
TAN has predefined vocabulary that can be invoked by calling it (through @which
) and giving it an
abbreviation to be used elsewhere in the document (@xml:id
).
<declarations>
has a new
child, <filter>
, which contains
a <normalization>
statement that declares, through the name and the IRI in the underlying TAN
definition, that we have opted to remove word-break line-end hyphenation. This
provides a cautionary note to users of our data who might value line-end hyphenation.
Any number of <normalization>
s can be used to describe any alterations we
might have made in our transcription. In other transcriptions we could use this
feature to declare other suppressions, such as editorial comments or footnote
signals.
Note that the value of div-type/@xml:id
here, the letter l
, differs from our
previous transcription file, line
. Even though we have adopted a
different nickname, they are treated as equivalent because in each file we have
defined l
or line
with the same IRI,
http://dbpedia.org/resource/Line_(poetry)
. A computer that later
looks for files with lines of poetry will not care about l
and
line
, but will look at the underlying IRI that defines these terms.
This exemplifies how linked data (see above) can support our work. We are free to use
abbreviations and terms that make sense to us, yet we can also tie those
abbreviations into the larger infrastructure by means of IRIs. It also means that we
can tether our texts to others on the basis of segmentns that may be generally rare
and unfamiliar or common but only to a specific field (e.g., sections of a legal
document).
Now that we have created the metadata for our transcriptions, we turn to the
alignment files. Those <head>
s
will look slightly different. We start with the TAN-A-div
file:
<head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <declarations/> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-08-14" who="park">Started file</change> </head>
Much of the code above will look similar to the previous two examples. Every
alignment file has only one kind of source, namely TAN transcription files, nothing
else. Therefore <source>
's
<IRI>
always takes the
@id
value of the corresponding
TAN transcription file. <name>
is
arbitrary. It may replicate exactly the title found in the transcription file, or it
may be modified, perhaps to harmonize better with the descriptions of the other texts
aligned in the file. <source>
also has an child element not seen in the earlier two examples, <location>
, which specifies where
the digital file was accessed and when (through @when-accessed
). We may include
as many of these <location>
elements as we wish, with the most preferred or reliable location at the top, since
the validation process will use first document that is available. The @when-accessed
value is
important, because the validator will look for changes in the file, and if there have
been changes since we last accessed the file, it will return a warning with a summary
of the number and kind of changes. If such a report is returned, it is up to us to
determine if the alterations merit any action on our part.
Our TAN-A-div file could have any number of <source>
s, and not necessarily for the same work. It also
does not matter in which order we put the <source>
s. <declarations>
is empty, mainly because we have, in this
case, no working assumptions to declare. In more advanced uses, this element would
not be empty.
This <head>
explains why the
<body>
of our TAN-A-div
file is allowed to be empty. We have already specified which sources are to be
aligned and where they are to be found. All TAN-A-div files assume, by default, that
every source that is a version of the same work should be aligned upon the basis of
the @n
value of <div>
s. That is, any user or processor
of a TAN-A-div file may assume that all implicit alignments should be made unless
otherwise specified.
For transcriptions that are already similarly structured and labeled, a TAN-A-div
file is unnecessary for alignment. But we will see that the options available in a
TAN-A-div's <declarations>
and <body>
will allow us not only to deal with inconsistencies in
source transcriptions but to make important statements, such indicating where one
work quotes from another.
Meanwhile we turn to our fourth file, TAN-A-tok, whose <head>
looks like
this:
<head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location>ringoroses.01+02.token.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="ring1881"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Ring o roses 1881</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="ring1987"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Ring o roses 1987</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <declarations> <bitext-relation xml:id="B-descends-from-A"> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>B descends directly from A, unknown number of intermediaries</name> <desc>The 1987 versions is hypothesized to descend somehow from the 1881 version, mainly for the sake of illustration.</desc> </bitext-relation> <reuse-type xml:id="adaptationGeneral"> <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI> <name>general adaptation</name> </reuse-type> <token-definition src="ring1881 ring1987" which="letters"/> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2015-01-20" who="park">Started file</change> </head>
The TAN-A-tok <head>
looks
similar to the previous examples, except that <declarations>
has three
children.
<bitext-relation>
states through an IRI + name pattern the stemmatic relationship we think holds
between the two sources. (Stemmatics is the study of the chain of transmission by a
single work eventually became the multiple copies, versions, and editions that are
extant; it frequently involves the creation of genealogical-like trees to illustrate
the work's version history.) We have used the entire IRI + name pattern, but we could
have substituted it with @which
and the value a/x+/b
.
One or more <reuse-type>
s specify how one text has reused another. The IRI we
have used shows that we believe that the later text has generally adapted the earlier
one. If this were a translation or a quotation or some other kind of text reuse, we
might have used a different IRI.
A third declaration, <token-definition>
, specifies how we have defined our word
tokens. @src
has more than one
value, specifying that the same tokenization rule should be applied to both
sources.
The value for @which
,
letters
, is a reserved TAN keyword that specifies that any
consecutive string of word characters, ignoring spaces and punctuation. Under this
token definition the phrase "Hush!" said he
would have three tokens. Had
we set the value of @which
to the
reserved TAN keyword letters and punctuation
, we would have six tokens,
since each punctuation mark would be defined as a token.
<token-definition>
is optional. If we leave it out, users are to assume that we mean
letters
. This is because most often, whenever in ordinary
conversation we refer to the nth word in a sentence we assume people will skip
punctuation marks in their counting.