We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc file for vocabulary shared across the files.
Let us imagine what it might be like to connect our TAN collection to a TAN file made by someone else. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <numerals priority="letters"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <adjustments> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </adjustments> <vocabulary-key> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <div-type which="poem" xml:id="Gedicht"/> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> </vocabulary-key> <file-resp who="schmidt"/> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> <to-do/> </head> <body xml:lang="deu"> <div type="Gedicht" n="1"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </div> </body> </TAN-T>
It seems that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or that for a given alignment we must align everything. In this case, we choose not to worry about word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for example, we can use the built-in TAN application to display the three versions in parallel, to study more closely the relationships between them.
To that end, we first observe some differences between this transcription and our
other two. First, the value of <work>
is not the one we have given our two versions. Second,
<numerals>
specifies by
its value for @priority
that
any ambiguous numerals should be interepreted as letter numerals, not Roman (that's
important, e.g., for a <div>
with
an @n
value c
, which
could be interpreted to mean 3 or the Roman numeral for 100). Next, the lines are
wrapped in a <div>
for the whole
poem (Gedicht
) and they have been lettered instead of numbered. And
last, the editor seems to have made a typographical error, making the last line
e
instead of the expected d
). These five differences
typify inconsistencies one commonly finds in digital texts from different projects of
the same work.
Note | |
---|---|
There are a few other differences in this third transcription that do not
affect our alignment. |
These are points we can easily reconcile in our TAN-A file, which we now expand to include the German version. We make the following adjustments (emphasized):
<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <adjustments src="ger"> <skip div-type="Gedicht"/> <rename n="e" by="-1"/> </adjustments> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <alias id="ring" idrefs="ger eng-us"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A>
The first major change is the insertion of a third <source>
, pointing to the new file
and specifying its name and IRI. Note that two <location>
s have been provided, one for the original and
another for a local copy we have saved. Validation will take into account only the
first document available. If we wanted to work primarily off our local copy, we would
have put that <location>
first. By placing it second, we allow the validation engine to work primarily off the
master version and therefore look for updates and changes. If that version is
unavailable, validation will be made against second, local copy.
<adjustments>
specifies
through its @src
that only the
German version should be adjusted by the contained instructions. The enclosed
<skip>
says, in effect,
to ignore the wrapping <div>
for
purposes of alignment. The <rename>
takes care of the apparent typographical error, and
anchors the German version to the U.S. one. Note that the German version uses
e
, but we have used 5
. But we could have used
e
, or even the Roman numeral v
, had we wished to. Every
TAN file's numeration system is evaluated locally, independent of any external files.
We need not reconcile the a
, b
, and c
@n
values in the German version,
because these will be automatically treated as equivalent to 1
,
2
, and 3
. The TAN format supports four numeration
systems other than Arabic numerals: Roman numerals (uppercase or lowercase),
alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations
(e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two
systems are interpreted as a two-tier numbering system.
The second major change, to address the German version's different value of
<work>
, is the addition of
an <alias>
, which allows us to
assign one or more vocabulary items a common id. Wherever the value ring
is used, it stands in for ger
and eng-us
, which point to
the two TAN-T files. Every TAN-T file has only one work and only one written source.
So if you wish to make a claim about a particular work or source, you can use a
TAN-T's id as a surrogate. So if we make claims in our TAN-A file about a written
source or a work, ring
would assert the claim to be true for the works
pointed to by the German and the U.S. version. (We do not need to specifically
mention eng-uk
in the <alias>
, since it has the same work IRI as the U.S. version
does.)
Note | |
---|---|
Alternatively, instead of |
The last major insertion is a new <change>
, documenting when we made the alterations. Its
@when
effectively updates
the version of our TAN-A file.
With these additions, the German version is now aligned with the other two. We could have made our work simpler just by directly modifying our local copy of the German version. But such a changes would not have affected the master copy. What happens when the owner of the German file makes changes? At that point we would struggle to integrate the changes in our forked copy. And we would have to repeat that exercise every time the German file was updated. By keeping our local copy of the German file unchanged, and making simple adjustments in our TAN-A file, we can keep our local copy synchronized with the master file and yet make the adjustments needed to coordinate with ours.
The purpose statement in these guidelines says that TAN was "designed to maximize the syntactic and semantic interoperable alignment and exchange of texts, annotations, and language resources across projects." Here we see the importance of the qualifier "maximize." In no world will there ever be (nor should there be, it seems) a single, standard, canonical way to divide a given work. The TAN format does not change that reality. Rather, it provides a convergent ecosystem in which different practices can be easily reconciled, to help editors and authors enhance cross-project interoperability without artificially forcing conformity, or suppressing legitimately different outlooks.
Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g.,
through his tag URN). We do so, and we suggest that he modify the version to make it
align better. Perhaps he has reasons for labeling the lines with letters, and perhaps
he is reluctant to explicitly identify this poem with Ring
around the Rosie. That is within his rights. But the conversation might
lead to our pointing out that n="e"
should probably be
n="d"
and that there is an apparent typographic error in the last
line. Or perhaps we're the ones in error. (The original, printed book has the poem
twice on page 438, one with the spelling "Holderbuch" at line 3, the other,
"Holderbusch".) If Schmidt chooses to correct his master file, he can add a new
<change>
, and thereby
tacitly notify anyone else using the file that corrections have been made.
At this point we have a network of six TAN files, five from our collection and one from outside. Although simple and small, this network could be extended to address some creative and complex research questions. Applications based on XSLT stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis.
What you've read so far is only a cursory introduction to TAN features. Study the rest of these guidelines, as well as example TAN libraries, and you will find numerous ways to develop TAN files, and to use them to enhance your research, teaching, and writing.