Aligning across projects

Aligning across projects
Prev	Chapter 2. Starting off with the TAN format	Next

We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc file for vocabulary shared across the files.

Let us imagine what it might be like to connect our TAN collection to a TAN file made by someone else. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   <head>
      <name>TAN Transkription, Ringelreihen mit Riederfallen</name>
      <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location>
      <license>
         <IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
         <name>Creative Commons Namensnennung 4.0 International Lizenz</name>
         <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0
            International Lizenz.</desc>
      </license>
      <licensor who="schmidt"/>
      <work>
         <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI>
         <name>"Die Kinder auf dem Holderbusch"</name>
      </work>
      <version>
         <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI>
         <name>zweite Version</name>
      </version>
      <numerals priority="letters"/>
      <source>
         <IRI>http://www.worldcat.org/oclc/4574384</IRI>
         <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen 
            aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. 
            Leipzig, 1897.</name>
      </source>
      <adjustments>
         <normalization>
            <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI>
            <name>Keine Bindestriche</name>
         </normalization>
      </adjustments>
      <vocabulary-key>
         <div-type xml:id="Zeile">
            <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI>
            <name>Gedichtzeile</name>
         </div-type>
         <div-type which="poem" xml:id="Gedicht"/>
         <person xml:id="schmidt" roles="Produzent">
            <IRI>tag:hans@beispiel.com,2014:selbst</IRI>
            <name xml:lang="eng">Hans Schmidt</name>
         </person>
         <role xml:id="Produzent">
            <IRI>http://schema.org/producer</IRI>
            <name xml:lang="eng">Produzent</name>
         </role>
      </vocabulary-key>
      <file-resp who="schmidt"/>
      <resp who="schmidt" roles="Produzent"/>
      <change when="2014-08-13" who="schmidt">Anfang</change>
      <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment>
      <to-do/>
   </head>
   <body xml:lang="deu">
      <div type="Gedicht" n="1">
         <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div>
         <div type="Zeile" n="b">Sind der Kinder dreie,</div>
         <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div>
         <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div>
      </div>
   </body>
</TAN-T>

It seems that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or that for a given alignment we must align everything. In this case, we choose not to worry about word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for example, we can use the built-in TAN application to display the three versions in parallel, a reading tool to study more closely intertextual relationships.

To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, <numerals> specifies by its value for @priority that any ambiguous numerals should be interepreted as letter numerals, not Roman (that's important, e.g., for a <div> with an @n value c, which could mean 3 [a, b, c, ...] or the Roman numeral for 100). Next, the lines are wrapped in a <div> for the whole poem (Gedicht) and they have been lettered instead of numbered. And last, the editor seems to have made a typographical error, making the last line e instead of the expected d). These five differences typify inconsistencies one commonly finds in digital texts from different projects of the same work.^[8]

These are points we can easily reconcile in our TAN-A file, which we now expand to include the German version. We make the following adjustments (emphasized):

<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring-alignment">
    <head>
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/>
       <license which="by_4.0" licensor="park"/>
       <source xml:id="eng-uk">
          <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
          <name>Transcription of ring around the roses in English (UK)</name>
          <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       </source>
       <source xml:id="eng-us">
          <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
          <name>Transcription of ring around the roses in English (US)</name>
          <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       </source>
       <source xml:id="ger">
          <IRI>tag:beispiel.com,2014:ringel</IRI>
          <name>Transcription of an ancestor of Ring around the roses in German</name>
          <location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location>
          <location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location>
       </source>
       <adjustments src="ger">
          <skip div-type="Gedicht"/>
          <rename n="e" by="-1"/>
       </adjustments>
       <vocabulary-key>
          <person xml:id="park" which="Jenny Park"/>
          <alias id="ring" idrefs="ger eng-us"/>
       </vocabulary-key>
       <resp who="park" roles="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
       <change when="2014-08-22" who="park">Added German version.</change>
       <to-do>
          <comment when="2018-08-09-04:00" who="park">Finish file.</comment>
       </to-do>
    </head>
    . . . . . .
</TAN-A>

The first major change is the insertion of a third <source>, pointing to the new file and specifying its name and IRI. Note that two <location>s have been provided, one for the original and another for a local copy we have saved. Validation will take into account only the first document available. If we wanted to work primarily off our local copy, we would have put that <location> first. By placing it second, we allow the validation engine to work primarily off the master version and therefore look for updates and changes. If that version is unavailable, validation will be made against second, local copy.

<adjustments> specifies through its @src that only the German version should be adjusted by the contained instructions. The enclosed <skip> says, in effect, to ignore the wrapping <div> for purposes of alignment. The <rename> takes care of the apparent typographical error, and anchors the German version to the U.S. one. Note that the German version uses e, but we have used 5. But we could have used e, or even the Roman numeral v, had we wished to. Every TAN file's numeration system is evaluated locally, independent of any external files. We need not reconcile the a, b, and c @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format supports four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems are interpreted as a two-tier numbering system.

The second major change, to address the German version's different value of <work>, is the addition of an <alias>, which allows us to assign one or more vocabulary items a common id. Wherever the value ring is used, it stands in for ger and eng-us, which point to the two TAN-T files. You may be familiar with this concept from critical editions, where a siglum, e.g., A might stand for several other sigla, e.g., a, b, and c. So every time you see something said about A, you know that by implication it is true of a, b, and c.

Every TAN-T file has only one work and only one written source. So if you wish to make a claim about a particular work or source, you can use a TAN-T's id as a surrogate. That is, the @id in <source> can stand it to represent either the work or the book or manuscript from which the text has been taken. So if we make claims in our TAN-A file about a written source or a work, ring would assert the claim to be true for the works pointed to by the German and the U.S. version. (We do not need to specifically mention eng-uk in the <alias>, since it has the same work IRI as the U.S. version does.) ^[9]

The last major insertion is a new <change>, documenting when we made the alterations. Its @when effectively updates the version of our TAN-A file.

With these additions, the German version is now aligned with the other two. We could have made our work simpler just by directly modifying our local copy of the German version. But such a change would not have affected the master copy. What happens when the owner of the German file makes changes? At that point we be faced with version conflict: changes in the original, and our own changes in the copy. We would struggle to reconcile the differences. And we would have to repeat that exercise every time the German file was updated. By keeping our local copy of the German file unchanged, and making simple adjustments in our TAN-A file, we can keep our local copy synchronized with the master file and yet make the adjustments needed to coordinate with ours.

The purpose statement in these guidelines says that TAN was "designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources." Here we see the importance of the qualifier "maximize." In no world will there ever be (nor should there be, it seems) a single, indisputable way to divide a given work. The TAN format does not change that reality. Rather, it provides a convergent ecosystem in which different practices can be easily reconciled, to help editors and authors enhance cross-project interoperability without artificially forcing conformity, or suppressing legitimately different outlooks.

Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g., through his tag URN). We do so, and we suggest that he modify the version to make it align better. Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent typographic error in the last line. Or perhaps we're the ones in error. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch" at line 3, the other, "Holderbusch".) If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made.

At this point we have a network of six TAN files, five from our collection and one from outside. Although simple and small, this network could be extended to address some creative and complex research questions. Applications based on XSLT stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis.

What you've read so far is only a cursory introduction to TAN features. Study the rest of these guidelines, as well as example TAN libraries, and you will find numerous ways to develop TAN files, and to use them to enhance your research, teaching, and writing.

^[8]There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a UUID. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions. Also, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry), so it doesn't intersect with our IRIs for the vocabulary item line. But <div-type> is not used to align versions, and validation isn't affected, so we do not concern ourselves here with trying to reconcile the different IRIs.

^[9]Alternatively, instead of <alias>, we could simply have adjusted our TAN-voc file, adding the German version's <IRI> value to the appropriate vocabulary item, and use that id.

Prev	Up	Next
Building TAN vocabulary	Home	Part II. Detailed description