Aligning across Projects

We now have a small, tightly knit corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found in a German project a TAN transcription of a work that looks quite similar to our own:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="" type="application/relax-ng-compact-syntax"?>
<?xml-model href="" type="application/xml" schematypens=""?>
<TAN-T xmlns=",2015:ns" id=",2014:ringel">
      <name>TAN Transkription, Ringelreihen mit Riederfallen</name>
      <rights-excluding-sources rights-holder="schmidt">
         <name>Creative Commons Namensnennung 4.0 International Lizenz.</name>
         <desc>Dieses Werk ist lizenziert unter einer Creative Commons 
            Namensnennung 4.0 International Lizenz.</desc>
         <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus
            allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig,
            <name>"Die Kinder auf dem Holderbusch"</name>
            <name>zweite Version</name>
         <div-type xml:id="Zeile">
               <name>Keine Bindestriche</name>
      <agent xml:id="schmidt" roles="Produzent">
         <name xml:lang="eng">Hans Schmidt</name>
      <role xml:id="Produzent">
         <name xml:lang="eng">Produzent</name>
      <change when="2014-08-13" who="schmidt">Anfang</change>
      <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment>
   <body xml:lang="deu" in-progress="false">
      <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div>
      <div type="Zeile" n="b">Sind der Kinder dreie,</div>
      <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div>
      <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div>

It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later generate an HTML report that will allow us to more conducively read the three versions in parallel and study their relationships.

To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, the <div-type> is defined as (Gedichtzeile = line of poetry). Third, the lines have been lettered instead of numbered. And last, the editor seems to have made a typographical error, making the last line n="e" instead of n="d"). These four differences typify some of the inconsistencies that are commonly found in digital texts.


There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a uuid. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions.

These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="" type="application/relax-ng-compact-syntax"?>
<?xml-model href="" type="application/xml" schematypens=""?>
<TAN-A-div xmlns=",2015:ns" id=",2015:ring-alignment">
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
       <source xml:id="eng-uk">
          <name>Transcription of ring around the roses in English (UK)</name>
          <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location>
       <source xml:id="eng-us">
          <name>Transcription of ring around the roses in English (US)</name>
          <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location>
       <source xml:id="ger">
          <name>Transcription of an ancestor of Ring around the roses in German</name>
          <location when-accessed="2014-08-22"></location>
          <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location>
       <agent xml:id="park" roles="creator">
          <name xml:lang="eng">Jenny Park</name>
       <role xml:id="creator" which="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
       <change when="2014-08-22" who="park">Added German version.</change>
       <equate-works src="eng-uk ger"/>
          <div-type-ref src="ger" div-type-ref="Zeile"/>
          <div-type-ref src="eng-uk" div-type-ref="line"/>
          <anchor-div-ref source="ger" ref="5"/>
          <div-ref source="eng-us" ref="4"/>

The first major change is the insertion of a new <source>, identifying the name and location of the third example. Note that two locations have been provided, one for the original location and another for the copy saved locally into our project folder. Validation will occur at the first document available. If we wanted to work primarily off our local copy, we would have put it first. By placing it second, we allow the validation engine to look for updates and changes in the master version. If that version is unavailable, validation will be made against second, local copy.

The second major insertion is a new <change>, documenting when we made the alterations. The value of @when effectively updates the version of our TAN-A-div file.

The third major change populates the <body> with elements that calibrate the new version to the other two. <equate-works> says that, for the sake of this alignment, the works defined in the UK version and the German version to be considered equivalent. We did not mention the US version because we do not need to. TAN rules specify that all alignments are transitive unless otherwise specified. If A and B are already defined to be the same work, and we equate A and C as the same work, then B and C will be equated as well. Note, we are not committing ourselves to the proposition that they are in reality the same work. We are making this statement only provisionally, to facilitate the alignment.

<equate-div-types> declares that what the German version calls Zeile is, for the sake of this alignment, equivalent to what the UK version calls line. Transitivity means that Zeile is inferred to be equivalent to what the US version calls l. This element is completely optional. If we left it out, the alignment, which is based upon references, not division types, would not be affected. But by creating it, we assist users who may care about textual divisions.

A <realign> takes care of the apparent typographical error, this time anchoring the German version to the US one. Any <div-ref> in a <realign> is wrested from automatic alignment and attached to an <anchor-div-ref> and, by the law of transitivity, anything that aligns to it, in this case the UK version.

Note that we have used 5 and not e to point to the stray reference in the German version. But we could have used e, or even the Roman numeral v, had we wished to, but we should find a single numbering system we're comfortable with for our TAN-A-div file, and stick with it. Every TAN file's numeration system is evaluated locally, independent of any companion files. That way a single TAN file can use a single kind of numbering to access multiple TAN documents that may each use different numerals. Therefore we do not need to reconcile the letter labels a, b, and c in the @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format allows four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will be converted to hyphen-joined Arabic numerals before comparison (e.g., 1-1, 1-5, 1-7, 1-4, 1-5, 2-5).

With these changes, the new version is completely synchronized with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file.

But the format has also been designed to anticipate a living, growing network. Perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so, and we suggest that he modify the version to make it align better. In the case of <div-type>, he need merely add another element: <IRI></IRI>. This line, in addition to the preexisting <IRI>, specifies that the two IRIs are equivalent. Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. (Remember, TAN is meant to provide a framework within which opinions can be registered, even counterintuitive ones.) But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent discrepancy in the last line. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made.

At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.