<head>
)Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:
<head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2018/examples/ring-o-roses.eng.1881.xml"/> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Attribution 4.0 International</name> </license> <licensor who="park"/> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <definitions> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </definitions> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> </head>
<name>
, the human readable
counterpart to the @id
that is
inside the root element, can be anything. And we can supply more than one <name>
, in case we wish to provide it
in different languages or variations.
<master-location>
is mandatory only if we have claimed through @in-progress
that the file is no
longer in progress. One or more of these elements provide URLs where master versions
of the file are kept (and updated). We provide this as a courtesy to others who might
be using our data. Anyone who validates a local copy of the file will be warned if it
does not match the master version, and be told the most recent changes. This allows
users to found out if changes have been made, and it allows us to make corrections
and silently notify other users of our alterations. To communicate this, we do not
have to keep track of who is using the file.
<license>
specifies the
license under which we are releasing our data. This element has nothing to do with
the copyright of the source we have used (although, having been published in 1881,
the book is clearly in the public domain). That is, we are declaring the rights
attached to the data, not its source. This once again gets to the TAN metadata
principle of describing our data and not other things. We can if we want describe the
license of the source we have used (see the rest of the guidelines for guidance), but
we absolutely must declare whether we have placed additional scrictures on the
dataset we have created. In this example, we have released the data under a creative
commons license. The child element <IRI>
specifies the IRI assigned by Creative Commons, and
<name>
is the
human-readable form.
<licensor>
, by means of
@who
, indicates who holds the
license. In this case it points to a person
The conjunction of <IRI>
and
<name>
, the IRI +
name pattern, is a recurrent feature of TAN files. We may include any
number of <IRI>
or <name>
elements in an IRI + name
pattern. But if we do so, we are stating that they all name the same thing, not
different things.
<source>
points, through its
IRI + name pattern, to a computer- and human-readable description of the book we have
chosen.
<definitions>
contains
data that is specific to TAN file types, to define our terminology.
<work>
uses the IRI + name
pattern to name the work we have chosen to transcribe. <div-type>
specifies the type of
divisions we have chosen to use to segment the transcription. In a more complex text,
there would be several <div-type>
s. Each one has an @xml:id
, which takes as a value some
nickname that we wish to use for @type
values of <div>
s.
The IRI + name pattern is also used for <person>
, which describes who was involved in creating the
data, and <role>
. We may have as
many <person>
s and <role>
s as we wish. In this case, Jenny
Park, has been given a tag URI. The <IRI>
value of <role>
comes from the vocabulary of schema.org, which is maintained by Bing,
Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated
to universal Internet standards), but we could have used Dublin Core or some other
IRI vocabulary describing behaviors, responsibilities, and roles.
Those roles and persons get combined after the <definitions>
, in a <resp>
, which stipulates who was responsible for what roles.
Note | |
---|---|
If you decide to modify someone else's TAN file, then you become responsible
for changes, not the original person or organization. Your first point of order
should be add a |
Remember that <head>
is
focused on the data, not its sources, so the claim that Jenny Park is the creator
pertains only to the data. No inference should be made about who created the source.
If someone wants that information, or anything else about the source, they should
pursue the identifier we have provided under <source>
.
<change>
has attributes
@when
and @who
that specify who made the
change/comment and when. The value of @when
is always a date plus optional time formatted according to
the standard YYYY-MM-DD
+ time (optional). @who
always carries a value that refers
to an agent/@xml:id
. Neither
<change>
nor <comment>
take <IRI>
or any other children.
So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of shortcuts:
<head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI> <name>Creative Commons Attribution 4.0 International License</name> <desc>This data file is licensed under a Creative Commons Attribution 4.0 International License. The license is granted independent of rights and licenses associated with the source. </desc> </license> <licensor who="park"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <definitions> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator" which="creator"/> </definitions> <alter> <normalization which="no hyphens"/> </alter> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> </head>
One significant difference is that three of the elements that normally take the
the section called “IRI + name Pattern” have been replaced with a simpler form that
takes merely @which
and
@xml:id
. For a number of
elements, TAN has predefined vocabulary that can be invoked by calling it (through
@which
) and giving it an
abbreviation to be used elsewhere in the document (@xml:id
).
After <definitions>
comes a new element, <alter>
,
which contains a <normalization>
statement that declares, through the name and
the IRI in the underlying TAN definition, that we have opted to remove word-break
line-end hyphenation. This provides a cautionary note to users of our data who might
value line-end hyphenation. Any number of <normalization>
s can be used to describe any alterations we
might have made in our transcription. In other transcriptions we could use this
feature to declare other suppressions, such as editorial comments or footnote
signals.
Note that the value of div-type/@xml:id
here, the letter l
, differs from our
previous transcription file, line
. Even though we have adopted a
different nickname, they are treated as equivalent because in each file we have
defined l
or line
with the same IRI,
http://dbpedia.org/resource/Line_(poetry)
. A computer that later
looks for files with lines of poetry will not care about l
and
line
, but will look at the underlying IRI that defines these terms.
This exemplifies how linked data (see above) can support our work. We are free to use
abbreviations and terms that make sense to us, yet we tie those abbreviations to IRIs
that have valence outside our project.
Now that we have created the metadata for our transcriptions, we turn to the
alignment files. Those <head>
s
will look slightly different. We start with the TAN-A-div
file:
<head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <license which="by_4.0"/> <licensor who="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <definitions> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </definitions> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> </head>
Much of the code above will look similar to the previous two examples. Every
alignment file has only one kind of source, namely TAN transcription files, nothing
else. Therefore <source>
's
<IRI>
always takes the
@id
value of the corresponding
TAN transcription file. <name>
is
arbitrary. It may replicate exactly the title found in the transcription file, or it
may be modified, perhaps to harmonize better with the descriptions of the other texts
aligned in the file. <source>
also has an child element not seen in the earlier two examples, <location>
, which specifies where
the digital file was accessed and when (through @when-accessed
). We may include
as many of these <location>
elements as we wish, with the most preferred or reliable location at the top, since
the validation process will use first document that is available. The @when-accessed
value is
important, because the validator will look for changes in the file, and if there have
been changes since we last accessed the file, it will return a warning with a summary
of the number and kind of changes. If such a report is returned, it is up to us to
determine if the alterations merit any action on our part.
Our TAN-A-div file could have any number of <source>
s, and not necessarily for the same work. It also
does not matter in which order we put the <source>
s. <definitions>
is empty, mainly because we have, in this
case, no working assumptions to declare. In more advanced uses, this element would
not be empty.
This <head>
explains why the
<body>
of our TAN-A-div
file is allowed to be empty. We have already specified which sources are to be
aligned and where they are to be found. All TAN-A-div files assume, by default, that
every source that is a version of the same work should be aligned upon the basis of
the @n
value of <div>
s. That is, any user or processor
of a TAN-A-div file may assume that all implicit alignments should be made unless
otherwise specified.
For transcriptions that are already similarly structured and labeled, a TAN-A-div
file is unnecessary for alignment. But we will see that the options available in a
TAN-A-div's <definitions>
and <body>
will allow us not only
to deal with inconsistencies in source transcriptions but to make important claims,
e.g., where one work quotes from another.
Meanwhile we turn to our fourth file, TAN-A-tok, whose <head>
looks like
this:
<head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location>ringoroses.01+02.token.1.xml</master-location> <license which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="ring1881"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Ring o roses 1881</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="ring1987"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Ring o roses 1987</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <definitions> <bitext-relation xml:id="B-descends-from-A"> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>B descends directly from A, unknown number of intermediaries</name> <desc>The 1987 versions is hypothesized to descend somehow from the 1881 version, mainly for the sake of illustration.</desc> </bitext-relation> <reuse-type xml:id="adaptationGeneral"> <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI> <name>general adaptation</name> </reuse-type> <token-definition src="ring1881 ring1987" which="letters"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator" which="creator"/> </definitions> <change when="2015-01-20" who="park">Started file</change> </head>
The TAN-A-tok <head>
looks
similar to the previous examples, except that <definitions>
has some new
content.
<bitext-relation>
states through an IRI + name pattern the stemmatic relationship we think holds
between the two sources. (Stemmatics is the study of the chain of transmission—the
relationship of an original text-bearing object to the ones that survive. It
frequently involves the creation of genealogical-like trees to illustrate the work's
version history.) We have used the entire IRI + name pattern, but we could have
substituted it with @which
and
the value a/x+/b
.
One or more <reuse-type>
s specify how one text has reused another. The IRI we
have used shows that we believe that the later text has generally adapted the earlier
one. If this were a translation or a quotation or some other kind of text reuse, we
might have used a different IRI.
A third declaration, <token-definition>
, specifies how we have defined our word
tokens. @src
has more than one
value, specifying that the same tokenization rule should be applied to both sources.
This element is optional. If we leave it out, users are to assume that we mean
letters
. This is because most often, whenever in ordinary
conversation we refer to the nth word in a sentence we assume people will skip
punctuation marks when they count.
The value for @which
,
letters
, is a reserved TAN keyword that specifies that any
consecutive string of word characters, ignoring spaces and punctuation. Under this
token definition the phrase "Hush!" said he
would have three tokens. Had
we set the value of @which
to the
reserved TAN keyword letters and punctuation
, we would have six tokens,
since each punctuation mark would be defined as a token.