We now have a small corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <definitions> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> <ambiguous-letter-numerals-are-roman>false</ambiguous-letter-numerals-are-roman> </definitions> <alter> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </alter> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> </head> <body xml:lang="deu" in-progress="false"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </body> </TAN-T>
It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later read the three versions in parallel and study their relationships.
To that end, we first observe some differences between this transcription and our
other two. First, the value of <work>
is not the one we have given our two versions. Second,
the <div-type>
is defined as
http://dbpedia.org/resource/Gedichtzeile
(Gedichtzeile = line of
poetry). Third, the lines have been lettered instead of numbered (and they are
stipulated to be letter numerals, not roman, through <ambiguous-letter-numerals-are-roman>
). And last, the editor
seems to have made a typographical error, making the last line n="e"
instead of n="d"
). These four differences typify some of the
inconsistencies that are commonly found in digital texts.
Note | |
---|---|
There are a few other differences in this third transcription that do not
affect our alignment. |
These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <license which="by_4.0"/> <licensor who="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <definitions> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> <alias id="ring" idrefs="ger eng-us"/> </definitions> <alter src="ger"> <rename n="5" by="-1"/> </alter> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> </head> <body/> </TAN-A-div>
The first major change is the insertion of a third <source>
, pointing to the new file
and specifying its name and IRI. Note that two locations have been provided, one for
the original location and another for the copy saved locally into our project folder.
Validation will occur at the first document available. If we wanted to work primarily
off our local copy, we would have put that <location>
first. By placing it second, we allow the
validation engine to look for updates and changes in the master version. If that
version is unavailable, validation will be made against second, local copy.
The second major change, to address the German version's different value of
<work>
, is the addition of
an <alias>
. If and when we make
claims about a work in general, via @work, the
id value ring
will mean that we're asserting the claim to be true for
any scriptum that shares the IRI values of the <work>
in either the German or the US version (which is why
we do not need to specifically mention eng-uk
in the <alias>
, since it already has a work
IRI in common with the US version).
A <rename>
takes care of the
apparent typographical error, this time anchoring the German version to the US one.
Note that the German version uses e
, but we have used 5
.
But we could have used e
, or even the Roman numeral v
, had
we wished to. Every TAN file's numeration system is evaluated locally, independent of
any companion files. So we need not reconcile the a
, b
, and
c
in the @n
values in
the German version, because these will be automatically treated as equivalent to
1
, 2
, and 3
. The TAN format allows four
numeration systems other than Arabic numerals: Roman numerals (uppercase or
lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet
combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5).
The last two systems will be treated as numerical pairs (1 and 1, 1 and 5,
etc.).
The last major insertion is a new <change>
, documenting when we made the alterations. The value
of @when
effectively updates the
version of our TAN-A-div file.
With these changes, the new version is aligned with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file.
But perhaps Hans Schmidt, the producer of the German version, can be contacted. We
do so, and we suggest that he modify the version to make it align better. In the case
of <div-type>
, he need merely
add another element: <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
(or even
better, use the built-in TAN vocabulary). Perhaps he has reasons for labeling the
lines with letters, and perhaps he is reluctant to explicitly identify this poem with
Ring around the Rosie. That is within his
rights. But the conversation might lead to our pointing out that n="e"
should probably be n="d"
and that there is an apparent discrepancy in
the last line. (The original, printed book has the poem twice on page 438, one with
the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct
his master file, he can add a new <change>
, and thereby tacitly notify anyone else using the
file that corrections have been made.
At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.