We now have a small, tightly knit corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found in a German project a TAN transcription of a work that looks quite similar to our own:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <rights-excluding-sources rights-holder="schmidt"> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz.</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </rights-excluding-sources> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <declarations> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <filter> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </filter> </declarations> <agent xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </agent> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> </head> <body xml:lang="deu" in-progress="false"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </body> </TAN-T>
It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later generate an HTML report that will allow us to more conducively read the three versions in parallel and study their relationships.
To that end, we first observe some differences between this transcription and our
other two. First, the value of <work>
is not the one we have given our two versions. Second,
the <div-type>
is defined as
http://dbpedia.org/resource/Gedichtzeile
(Gedichtzeile = line of
poetry). Third, the lines have been lettered instead of numbered. And last, the
editor seems to have made a typographical error, making the last line
n="e"
instead of n="d"
). These four differences typify
some of the inconsistencies that are commonly found in digital texts.
Note | |
---|---|
There are a few other differences in this third transcription that do not
affect our alignment. |
These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <declarations/> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> </head> <body> <equate-works src="eng-uk ger"/> <equate-div-types> <div-type-ref src="ger" div-type-ref="Zeile"/> <div-type-ref src="eng-uk" div-type-ref="line"/> </equate-div-types> <realign> <anchor-div-ref source="ger" ref="5"/> <div-ref source="eng-us" ref="4"/> </realign> </body> </TAN-A-div>
The first major change is the insertion of a new <source>
, identifying the name and
location of the third example. Note that two locations have been provided, one for
the original location and another for the copy saved locally into our project folder.
Validation will occur at the first document available. If we wanted to work primarily
off our local copy, we would have put it first. By placing it second, we allow the
validation engine to look for updates and changes in the master version. If that
version is unavailable, validation will be made against second, local copy.
The second major insertion is a new <change>
, documenting when we made the alterations. The value
of @when
effectively updates the
version of our TAN-A-div file.
The third major change populates the <body>
with elements that calibrate the new version to the
other two. <equate-works>
says that, for the sake of this alignment, the works defined in the UK version and
the German version to be considered equivalent. We did not mention the US version
because we do not need to. TAN rules specify that all alignments are transitive
unless otherwise specified. If A and B are already defined to be the same work, and
we equate A and C as the same work, then B and C will be equated as well. Note, we
are not committing ourselves to the proposition that they are in reality the same
work. We are making this statement only provisionally, to facilitate the
alignment.
<equate-div-types>
declares that what the German version calls Zeile is, for the sake of this alignment,
equivalent to what the UK version calls line. Transitivity means that Zeile is
inferred to be equivalent to what the US version calls l
. This element
is completely optional. If we left it out, the alignment, which is based upon
references, not division types, would not be affected. But by creating it, we assist
users who may care about textual divisions.
A <realign>
takes care of
the apparent typographical error, this time anchoring the German version to the US
one. Any <div-ref>
in a
<realign>
is wrested
from automatic alignment and attached to an <anchor-div-ref>
and, by the
law of transitivity, anything that aligns to it, in this case the UK version.
Note that we have used 5
and not e
to point to the stray
reference in the German version. But we could have used e
, or even the
Roman numeral v
, had we wished to, but we should find a single numbering
system we're comfortable with for our TAN-A-div file, and stick with it. Every TAN
file's numeration system is evaluated locally, independent of any companion files.
That way a single TAN file can use a single kind of numbering to access multiple TAN
documents that may each use different numerals. Therefore we do not need to reconcile
the letter labels a
, b
, and c
in the
@n
values in the German
version, because these will be automatically treated as equivalent to 1
,
2
, and 3
. The TAN format allows four numeration systems
other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic
numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a,
1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will
be converted to hyphen-joined Arabic numerals before comparison (e.g., 1-1, 1-5, 1-7,
1-4, 1-5, 2-5).
With these changes, the new version is completely synchronized with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file.
But the format has also been designed to anticipate a living, growing network.
Perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so,
and we suggest that he modify the version to make it align better. In the case of
<div-type>
, he need
merely add another element: <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
. This
line, in addition to the preexisting <IRI>
, specifies that the two IRIs are equivalent. Perhaps he
has reasons for labeling the lines with letters, and perhaps he is reluctant to
explicitly identify this poem with Ring around the
Rosie. That is within his rights. (Remember, TAN is meant to provide a
framework within which opinions can be registered, even counterintuitive ones.) But
the conversation might lead to our pointing out that n="e"
should
probably be n="d"
and that there is an apparent discrepancy in the last
line. (The original, printed book has the poem twice on page 438, one with the
spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his
master file, he can add a new <change>
, and thereby tacitly notify anyone else using the file
that corrections have been made.
At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.