Building TAN Vocabulary

Building TAN Vocabulary
Prev	Chapter 2. Starting off with the TAN Format	Next

The first TAN-T transcription had a longer <head> than the second one did, and that is because for the former we used an explicit method, that of specifying every IRI and name, and then in the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies are meant not merely to be a convenience; they are intended to avoid problems that beset projects that create many files with repeated data patterns. When (not if) you make changes to one file you have to remember all the other places where you might need to make the same changes. The old programmer's adage "Don't repeat yourself" (DRY) is operative here. If there is a repeating data pattern, put it in one master place, and let the other files point to that pattern. When we make changes, we do so only at a single place.

The previous examples drew from standard TAN vocabulary, which is written in one of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc files in the project subdirectory called vocabularies. We can write our own TAN-voc files, to collect the vocabulary items that we will use repeatedly from one file to the next. For example:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../../schemas/TAN-voc.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="../../schemas/TAN-voc.sch" type="application/xml" 
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:TAN-voc:standard">
    <head>
        <name>Keywords for TAN files edited by Jenny Park</name>
        <license licensor="park" which="by 4.0"/>
        <vocabulary-key>
            <person which="Jenny Park" xml:id="park"/>
        </vocabulary-key>
        <file-resp who="park"/>
        <resp roles="creator" who="park"/>
        <change when="2019-10-08" who="park">Started file</change>
        <to-do>
            <comment when="2020-01-04" who="park">Need to check files for new vocabulary items.</comment>
        </to-do>
    </head>
    <body>
        <group affects-element="person">
            <item>
                <IRI>tag:parkj@textalign.net,2015:self</IRI>
                <name xml:lang="eng">Jenny Park</name>
            </item>
        </group>
        <item affects-element="work">
            <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
            <name>Ring a Ring o' Roses</name>
            <name>Ring Around the Rosie</name>
        </item>
    </body>
</TAN-voc>

In this example case, updates have been made to @id and <name>, and a <comment> has been added to <to-do>. The most significant difference is the <body>, which has two <item>s, one of which is wrapped in a <group>. Each @affects-element specifies one or more names of elements that the enclosed items affect, and the <item>s have the standard IRI + name pattern. <group>s may nest as you like.

The difference between a grouped and ungrouped <item> is purely a matter of taste and convenience. The example above illustrates both methods.

The <vocabulary-key> has a <person> whose @which points to the body of the first <item>. That is, a TAN-voc file can use its own vocabulary, without repeating it in <vocabulary-key>.

Let's return to the <head>s of our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary file.

<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring01">
    <head>
        <name>TAN transcription of Ring a Ring o' Roses</name>
        <master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
        <license which="by 4.0" licensor="park"/>
        <work which="Ring around the Rosie"/>
        <source>
            <IRI>http://lccn.loc.gov/12032709</IRI>
            <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name>
        </source>
        <vocabulary>
           <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI>
           <name>Vocabulary for TAN files edited by Jenny Park</name>
           <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
        </vocabulary>
        <vocabulary-key>
            <person xml:id="park" which="Jenny Park"/>
            <div-type xml:id="line" which="line (verse)"/>
        </vocabulary-key>
        <file-resp who="park"/>
        <resp roles="creator" who="park"/>
        <change when="2014-08-13" who="park">Started file</change>
        <to-do/>
    </head>
    . . . . . . .
</TAN-T>

<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring02">
    <head>
      <name>TAN transcription of Ring around the Rosie</name>
      <master-location>ring-o-roses.eng.1987.xml</master-location>
      <license which="by 4.0" licensor="park"/>
      <work which="Ring around the Rosie"/>
      <source>
         <IRI>http://lccn.loc.gov/87042504</IRI>
         <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name>
      </source>
      <vocabulary>
         <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI>
         <name>Vocabulary for TAN files edited by Jenny Park</name>
         <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
      </vocabulary>
      <adjustments>
         <normalization which="no hyphens"/>
      </adjustments>
      <vocabulary-key>
         <div-type xml:id="l" which="line (verse)"/>
         <person xml:id="park" which="Jenny Park"/>
      </vocabulary-key>
      <resp roles="creator" who="park"/>
      <change when="2014-10-24" who="park">Started file</change>
      <comment when="2014-10-24" who="park">See p. 39 of source.</comment>
      <to-do/>
   </head>
   . . . . . .
</TAN-T>

In each TAN-T file, a new <vocabulary> points to the project TAN-voc vocabulary file we have just created. Along with the customary IRI + name pattern is a new element, <location>, which specifies where the digital file was accessed and when (through @accessed-when). We may include as many of these <location> elements as we wish, with the most preferred or reliable one at the top. The validation process will consult only the first one that leads to an available document. The @accessed-when value is important, because the validator will look for changes in the file since we last accessed it, and if any changes are found a warning with a summary of the changes will be returned. It is then up to us to determine if the alterations merit any action on our part.

Similarly, anyone using or dependending upon our file will be notified of any changes we make, through the same validation process.

Once the <vocabulary> is in place, we can draw from our predefined vocabulary. Hence, these revised versions of the <head>s are a bit more compact and easier to read. The longer the TAN file, the more noticable the improvement. And when our library grows into dozens of files, we'll be grateful that a change that affects all the files needs to be made only once.

Now that we have created the metadata for our transcriptions, let's turn to the alignment files. Those <head>s will look slightly different, because they are not concerned with transcriptions per se. We start with the TAN-A file:

<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring-alignment">
    <head>
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/>
       <license which="by_4.0" licensor="park"/>
       <source xml:id="eng-uk">
          <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
          <name>Transcription of ring around the roses in English (UK)</name>
          <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       </source>
       <source xml:id="eng-us">
          <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
          <name>Transcription of ring around the roses in English (US)</name>
          <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       </source>
       <vocabulary-key>
          <person xml:id="park" which="Jenny Park"/>
       </vocabulary-key>
       <resp who="park" roles="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
       <to-do>
          <comment when="2018-08-09-04:00" who="park">Finish file.</comment>
       </to-do>
    </head>
    . . . . . .
</TAN-A>

Much of the code above will look similar to the previous two examples. The file's <name> and <master-location> are updated. Just like TAN-T files have <source>s, so TAN-A files do as well, except that those sources are always TAN-T transcription files, and they take the IRI + name + location pattern we saw above in <vocabulary>. Because alignment files take only TAN transcription files as sources, each <source>'s <IRI> always takes the @id value of the target TAN-T transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other source names. Our TAN-A file could have any number of <source>s, and not necessarily for the same work. The order in which we put the <source>s does not necessarily mean anything.

This <head> explains why the <body> of our TAN-A file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. Any user or processor of a TAN-A file may assume that every <div> in every source should be automatically aligned upon the basis of shared values of @n.

Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> might look like this:

<TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    <head>
        <name>token-based alignment of two versions of Ring o Roses</name>
        <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml"/>
        <license which="by-nc-nd_4.0" rights-holder="park"/>
        <token-definition src="ring1881 ring1987" which="letters"/>
        <source xml:id="eng-uk">
            <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
            <name>Transcription of ring around the roses in English (UK)</name>
            <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
        </source>
        <source xml:id="eng-us">
            <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
            <name>Transcription of ring around the roses in English (US)</name>
            <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
        </source>
        <vocabulary-key>
            <bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/>
            <token-definition src="ring1881 ring1987" which="letters"/>
            <person xml:id="park" which="Jenny Park"/>
        </vocabulary-key>
        <change when="2015-01-20" who="park">Started file</change>
    </head>
    . . . . . .
</TAN-A-tok>

The TAN-A-tok <head> looks similar to the previous examples, except that <vocabulary-key> has some new content.

<bitext-relation> states through @which or an IRI + name pattern the stemmatic relationship we think holds between the two sources. We have used @which and the value a/x+/b, pointing to a standard TAN vocabulary item for bitext relations:

<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:textalign.net,2015:tan-voc:bitext-relation">
. . . . . .
        <item>
            <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI>
            <name>a/x+/b</name>
            <desc>direct descent, B descends from A, one or more mediaries</desc>
        </item>
. . . . . .
</TAN-voc>

<token-definition> specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. @which points to this standard TAN vocabulary item:

<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:textalign.net,2015:tan-voc:tokenizations">
. . . . . .
        <item>
            <token-definition pattern="[\w&#xad;&#x200b;&#x200d;]+"/>
            <name>letters</name>
            <name>letters only</name>
            <name>general word characters only</name>
            <name>general ignore punctuation</name>
            <name>gwo</name>
            <desc>General tokenization pattern for any language, words only. Non-letters 
                such as punctuation are ignored.</desc>
        </item>
. . . . . .
</TAN-voc>

Up until now, all vocabulary items have taken the IRI + name pattern. The one above does not have an IRI, only a <token-definition> with a @pattern. The value of @pattern, which may look like gibberish, is a regular expression. "Regular" here does not mean ordinary; rather it derives from the Latin regula, rule. Regular expressions are rule-based patterned text searches. This particular pattern says that a token is defined as any contiguous string of word characters (\w), soft hyphens (), zero-width spaces (), or zero-width joiners (‍). This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok file that lacks a <token-definition>. TAN adopts this default because in ordinary conversation, when we refer to the nth word in a sentence, we most often ignore punctuation marks. For more on token definitions see the section called “Defining Words and Tokens” and the section called “TAN keywords for types of token definitions (<token-definition>)”. See also the section called “Regular Expressions”.

In our <vocabulary-key> we could have also included a <reuse-type>, but we have intentionally omitted it here, because we have <body bitext-relation="B-descends-from-A" reuse-type="general_adaptation">. The value for @reuse-type, general_adaptation, corresponds to a <name> in a standard TAN vocabulary item for reuse types. We don't need to invoke a <reuse-type> in the <vocabulary-key> because we have opted not to give it an @xml:id. Notice that general_adaptation has an underscore instead of a space. That's because <reuse-type> can take multiple values, which are signified by spaces. We could have used a hyphen instead of an underscore, if we preferred. The values of <name> are never case-sensitive, and the space, hyphen, and underscore are treated as equivalent.

Prev	Up	Next
Creating TAN Metadata (<head>)	Home	Aligning across Projects