We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc file for vocabulary shared across the files.
Let us imagine what it might be like to connect our TAN collection to a TAN file made by someone else. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <numerals priority="letters"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <adjustments> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </adjustments> <vocabulary-key> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <div-type which="poem" xml:id="Gedicht"/> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> </vocabulary-key> <file-resp who="schmidt"/> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> <to-do/> </head> <body xml:lang="deu"> <div type="Gedicht" n="1"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </div> </body> </TAN-T>
It seems that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or that for a given alignment we must align everything. In this case, we choose not to worry about word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for example, we can use the built-in TAN application to display the three versions in parallel, a reading tool to study more closely intertextual relationships.
To that end, we first observe some differences between this transcription and our
other two. First, the value of <work>
is not the one we have given our two versions. Second,
<numerals>
specifies by
its value for @priority
that
any ambiguous numerals should be interepreted as letter numerals, not Roman (that's
important, e.g., for a <div>
with
an @n
value c
, which
could mean 3 [a, b, c, ...] or the Roman numeral for 100). Next, the lines are
wrapped in a <div>
for the whole
poem (Gedicht
) and they have been lettered instead of numbered. And
last, the editor seems to have made a typographical error, making the last line
e
instead of the expected d
). These five differences
typify inconsistencies one commonly finds in digital texts from different projects of
the same work.[8]
These are points we can easily reconcile in our TAN-A file, which we now expand to include the German version. We make the following adjustments (emphasized):
<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <adjustments src="ger"> <skip div-type="Gedicht"/> <rename n="e" by="-1"/> </adjustments> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <alias id="ring" idrefs="ger eng-us"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A>
The first major change is the insertion of a third <source>
, pointing to the new file
and specifying its name and IRI. Note that two <location>
s have been provided, one for the original and
another for a local copy we have saved. Validation will take into account only the
first document available. If we wanted to work primarily off our local copy, we would
have put that <location>
first. By placing it second, we allow the validation engine to work primarily off the
master version and therefore look for updates and changes. If that version is
unavailable, validation will be made against second, local copy.
<adjustments>
specifies
through its @src
that only the
German version should be adjusted by the contained instructions. The enclosed
<skip>
says, in effect,
to ignore the wrapping <div>
for
purposes of alignment. The <rename>
takes care of the apparent typographical error, and
anchors the German version to the U.S. one. Note that the German version uses
e
, but we have used 5
. But we could have used
e
, or even the Roman numeral v
, had we wished to. Every
TAN file's numeration system is evaluated locally, independent of any external files.
We need not reconcile the a
, b
, and c
@n
values in the German version,
because these will be automatically treated as equivalent to 1
,
2
, and 3
. The TAN format supports four numeration
systems other than Arabic numerals: Roman numerals (uppercase or lowercase),
alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations
(e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two
systems are interpreted as a two-tier numbering system.
The second major change, to address the German version's different value of
<work>
, is the addition of
an <alias>
, which allows us to
assign one or more vocabulary items a common id. Wherever the value ring
is used, it stands in for ger
and eng-us
, which point to
the two TAN-T files. You may be familiar with this concept from critical editions,
where a siglum, e.g., A might stand for several other sigla, e.g., a, b, and c. So
every time you see something said about A, you know that by implication it is true of
a, b, and c.
Every TAN-T file has only one work and only one written source. So if you wish to
make a claim about a particular work or source, you can use a TAN-T's id as a
surrogate. That is, the @id
in
<source>
can stand it to
represent either the work or the book or manuscript from which the text has been
taken. So if we make claims in our TAN-A file about a written source or a work,
ring
would assert the claim to be true for the works pointed to by
the German and the U.S. version. (We do not need to specifically mention
eng-uk
in the <alias>
, since it has the same work IRI as the U.S. version
does.) [9]
The last major insertion is a new <change>
, documenting when we made the alterations. Its
@when
effectively updates
the version of our TAN-A file.
With these additions, the German version is now aligned with the other two. We could have made our work simpler just by directly modifying our local copy of the German version. But such a change would not have affected the master copy. What happens when the owner of the German file makes changes? At that point we be faced with version conflict: changes in the original, and our own changes in the copy. We would struggle to reconcile the differences. And we would have to repeat that exercise every time the German file was updated. By keeping our local copy of the German file unchanged, and making simple adjustments in our TAN-A file, we can keep our local copy synchronized with the master file and yet make the adjustments needed to coordinate with ours.
The purpose statement in these guidelines says that TAN was "designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources." Here we see the importance of the qualifier "maximize." In no world will there ever be (nor should there be, it seems) a single, indisputable way to divide a given work. The TAN format does not change that reality. Rather, it provides a convergent ecosystem in which different practices can be easily reconciled, to help editors and authors enhance cross-project interoperability without artificially forcing conformity, or suppressing legitimately different outlooks.
Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g.,
through his tag URN). We do so, and we suggest that he modify the version to make it
align better. Perhaps he has reasons for labeling the lines with letters, and perhaps
he is reluctant to explicitly identify this poem with Ring
around the Rosie. That is within his rights. But the conversation might
lead to our pointing out that n="e"
should probably be
n="d"
and that there is an apparent typographic error in the last
line. Or perhaps we're the ones in error. (The original, printed book has the poem
twice on page 438, one with the spelling "Holderbuch" at line 3, the other,
"Holderbusch".) If Schmidt chooses to correct his master file, he can add a new
<change>
, and thereby
tacitly notify anyone else using the file that corrections have been made.
At this point we have a network of six TAN files, five from our collection and one from outside. Although simple and small, this network could be extended to address some creative and complex research questions. Applications based on XSLT stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis.
What you've read so far is only a cursory introduction to TAN features. Study the rest of these guidelines, as well as example TAN libraries, and you will find numerous ways to develop TAN files, and to use them to enhance your research, teaching, and writing.
[8] There are a few other differences in this third transcription that do not
affect our alignment. <version>
is used to distinguish different versions of the
same work found on the same text-bearing object. That is, if we are
transcribing a bilingual edition, we can use <version>
to specify which of
the two versions we are encoding. Notice that the <IRI>
value is a UUID. In this
case the editor was not prepared to deploy a formal IRI naming scheme (perhaps
using a tag URN) that would be satisfactory for work-versions. Also, the
<div-type>
is
defined as http://dbpedia.org/resource/Gedichtzeile
(Gedichtzeile
= line of poetry), so it doesn't intersect with our IRIs for the vocabulary
item line
. But <div-type>
is not used to align versions, and
validation isn't affected, so we do not concern ourselves here with trying to
reconcile the different IRIs.