Aligning across Projects

We now have a small corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   <head>
      <name>TAN Transkription, Ringelreihen mit Riederfallen</name>
      <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location>
      <license>
         <IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
         <name>Creative Commons Namensnennung 4.0 International Lizenz</name>
         <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0
            International Lizenz.</desc>
      </license>
      <licensor who="schmidt"/>
      <source>
         <IRI>http://www.worldcat.org/oclc/4574384</IRI>
         <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus
            allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig,
            1897.</name>
      </source>
      <definitions>
         <work>
            <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI>
            <name>"Die Kinder auf dem Holderbusch"</name>
         </work>
         <version>
            <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI>
            <name>zweite Version</name>
         </version>
         <div-type xml:id="Zeile">
            <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI>
            <name>Gedichtzeile</name>
         </div-type>
         <person xml:id="schmidt" roles="Produzent">
            <IRI>tag:hans@beispiel.com,2014:selbst</IRI>
            <name xml:lang="eng">Hans Schmidt</name>
         </person>
         <role xml:id="Produzent">
            <IRI>http://schema.org/producer</IRI>
            <name xml:lang="eng">Produzent</name>
         </role>
         <ambiguous-letter-numerals-are-roman>false</ambiguous-letter-numerals-are-roman>
      </definitions>
      <alter>
         <normalization>
            <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI>
            <name>Keine Bindestriche</name>
         </normalization>
      </alter>
      <resp who="schmidt" roles="Produzent"/>
      <change when="2014-08-13" who="schmidt">Anfang</change>
      <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment>
   </head>
   <body xml:lang="deu" in-progress="false">
      <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div>
      <div type="Zeile" n="b">Sind der Kinder dreie,</div>
      <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div>
      <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div>
   </body>
</TAN-T>

It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later read the three versions in parallel and study their relationships.

To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry). Third, the lines have been lettered instead of numbered (and they are stipulated to be letter numerals, not roman, through <ambiguous-letter-numerals-are-roman>). And last, the editor seems to have made a typographical error, making the last line n="e" instead of n="d"). These four differences typify some of the inconsistencies that are commonly found in digital texts.

[Note]Note

There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a uuid. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions.

These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    <head>
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <master-location>ringoroses.div.1.xml</master-location>
       <license which="by_4.0"/>
       <licensor who="park"/>
       <source xml:id="eng-uk">
          <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
          <name>Transcription of ring around the roses in English (UK)</name>
          <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location>
       </source>
       <source xml:id="eng-us">
          <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
          <name>Transcription of ring around the roses in English (US)</name>
          <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location>
       </source>
       <source xml:id="ger">
          <IRI>tag:beispiel.com,2014:ringel</IRI>
          <name>Transcription of an ancestor of Ring around the roses in German</name>
          <location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location>
          <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location>
       </source>
       <definitions>
           <person xml:id="park">
               <IRI>tag:parkj@textalign.net,2015:self</IRI>
               <name xml:lang="eng">Jenny Park</name>
           </person>
           <role xml:id="creator">
               <IRI>http://schema.org/creator</IRI>
               <name xml:lang="eng">creator</name>
           </role>
           <alias id="ring" idrefs="ger eng-us"/>
       </definitions>
      <alter src="ger">
           <rename n="5" by="-1"/>
       </alter>
       <resp who="park" roles="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
       <change when="2014-08-22" who="park">Added German version.</change>
    </head>
    <body/>
</TAN-A-div>

The first major change is the insertion of a third <source>, pointing to the new file and specifying its name and IRI. Note that two locations have been provided, one for the original location and another for the copy saved locally into our project folder. Validation will occur at the first document available. If we wanted to work primarily off our local copy, we would have put that <location> first. By placing it second, we allow the validation engine to look for updates and changes in the master version. If that version is unavailable, validation will be made against second, local copy.

The second major change, to address the German version's different value of <work>, is the addition of an <alias>. If and when we make claims about a work in general, via @work, the id value ring will mean that we're asserting the claim to be true for any scriptum that shares the IRI values of the <work> in either the German or the US version (which is why we do not need to specifically mention eng-uk in the <alias>, since it already has a work IRI in common with the US version).

A <rename> takes care of the apparent typographical error, this time anchoring the German version to the US one. Note that the German version uses e, but we have used 5. But we could have used e, or even the Roman numeral v, had we wished to. Every TAN file's numeration system is evaluated locally, independent of any companion files. So we need not reconcile the a, b, and c in the @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format allows four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will be treated as numerical pairs (1 and 1, 1 and 5, etc.).

The last major insertion is a new <change>, documenting when we made the alterations. The value of @when effectively updates the version of our TAN-A-div file.

With these changes, the new version is aligned with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file.

But perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so, and we suggest that he modify the version to make it align better. In the case of <div-type>, he need merely add another element: <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> (or even better, use the built-in TAN vocabulary). Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent discrepancy in the last line. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made.

At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.