Creating TAN Metadata (<head>)

Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:

    <head>
        <name>TAN transcription of Ring a Ring o' Roses</name>
        <master-location href="http://textalign.net/release/TAN-2018/examples/ring-o-roses.eng.1881.xml"/>
        <license>
            <IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
            <name>Attribution 4.0 International</name>
        </license>
        <licensor who="park"/>
        <source>
            <IRI>http://lccn.loc.gov/12032709</IRI>
            <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name>
        </source>
        <definitions>
            <work>
                <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
                <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name>
            </work>
            <div-type xml:id="line">
                <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
                <name>line of poetry</name>
            </div-type>
            <person xml:id="park">
                <IRI>tag:parkj@textalign.net,2015:self</IRI>
                <name>Jenny Park</name>
            </person>
            <role xml:id="creator">
                <IRI>http://schema.org/creator</IRI>
                <name xml:lang="eng">creator</name>
            </role>
        </definitions>
        <resp roles="creator" who="park"/>
        <change when="2014-08-13" who="park">Started file</change>
    </head>

<name>, the human readable counterpart to the @id that is inside the root element, can be anything. And we can supply more than one <name>, in case we wish to provide it in different languages or variations.

<master-location> is mandatory only if we have claimed through @in-progress that the file is no longer in progress. One or more of these elements provide URLs where master versions of the file are kept (and updated). We provide this as a courtesy to others who might be using our data. Anyone who validates a local copy of the file will be warned if it does not match the master version, and be told the most recent changes. This allows users to found out if changes have been made, and it allows us to make corrections and silently notify other users of our alterations. To communicate this, we do not have to keep track of who is using the file.

<license> specifies the license under which we are releasing our data. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). That is, we are declaring the rights attached to the data, not its source. This once again gets to the TAN metadata principle of describing our data and not other things. We can if we want describe the license of the source we have used (see the rest of the guidelines for guidance), but we absolutely must declare whether we have placed additional scrictures on the dataset we have created. In this example, we have released the data under a creative commons license. The child element <IRI> specifies the IRI assigned by Creative Commons, and <name> is the human-readable form.

<licensor>, by means of @who, indicates who holds the license. In this case it points to a person

The conjunction of <IRI> and <name>, the IRI + name pattern, is a recurrent feature of TAN files. We may include any number of <IRI> or <name> elements in an IRI + name pattern. But if we do so, we are stating that they all name the same thing, not different things.

<source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen.

<definitions> contains data that is specific to TAN file types, to define our terminology.

<work> uses the IRI + name pattern to name the work we have chosen to transcribe. <div-type> specifies the type of divisions we have chosen to use to segment the transcription. In a more complex text, there would be several <div-type>s. Each one has an @xml:id, which takes as a value some nickname that we wish to use for @type values of <div>s.

The IRI + name pattern is also used for <person>, which describes who was involved in creating the data, and <role>. We may have as many <person>s and <role>s as we wish. In this case, Jenny Park, has been given a tag URI. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles.

Those roles and persons get combined after the <definitions> , in a <resp>, which stipulates who was responsible for what roles.

[Note]Note

If you decide to modify someone else's TAN file, then you become responsible for changes, not the original person or organization. Your first point of order should be add a <person> to the head, identifying yourself. You need not change the document's @id, but you should take responsibility for any changes you make, otherwise you are incorrectly attributing your changes to someone else.

Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who created the source. If someone wants that information, or anything else about the source, they should pursue the identifier we have provided under <source>.

<change> has attributes @when and @who that specify who made the change/comment and when. The value of @when is always a date plus optional time formatted according to the standard YYYY-MM-DD + time (optional). @who always carries a value that refers to an agent/@xml:id. Neither <change> nor <comment> take <IRI> or any other children.

So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of shortcuts:

    <head>
      <name>TAN transcription of Ring around the Rosie</name>
      <master-location>ring-o-roses.eng.1987.xml</master-location>
      <license>
         <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI>
         <name>Creative Commons Attribution 4.0 International License</name>
         <desc>This data file is licensed under a Creative Commons Attribution 4.0 International
            License. The license is granted independent of rights and licenses associated with the
            source. </desc>
      </license>
      <licensor who="park"/>
      <source>
         <IRI>http://lccn.loc.gov/87042504</IRI>
         <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name>
      </source>
      <definitions>
         <work>
            <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
            <name>Ring around the Rosie</name>
         </work>
         <div-type xml:id="l" which="line (verse)"/>
         <person xml:id="park" roles="creator">
            <IRI>tag:parkj@textalign.net,2015:self</IRI>
            <name xml:lang="eng">Jenny Park</name>
         </person>
         <role xml:id="creator" which="creator"/>
      </definitions>
      <alter>
         <normalization which="no hyphens"/>
      </alter>
      <resp roles="creator" who="park"/>
      <change when="2014-10-24" who="park">Started file</change>
      <comment when="2014-10-24" who="park">See p. 39 of source.</comment>
   </head>

One significant difference is that three of the elements that normally take the the section called “IRI + name Pattern” have been replaced with a simpler form that takes merely @which and @xml:id. For a number of elements, TAN has predefined vocabulary that can be invoked by calling it (through @which) and giving it an abbreviation to be used elsewhere in the document (@xml:id).

After <definitions> comes a new element, <alter>, which contains a <normalization> statement that declares, through the name and the IRI in the underlying TAN definition, that we have opted to remove word-break line-end hyphenation. This provides a cautionary note to users of our data who might value line-end hyphenation. Any number of <normalization>s can be used to describe any alterations we might have made in our transcription. In other transcriptions we could use this feature to declare other suppressions, such as editorial comments or footnote signals.

Note that the value of div-type/@xml:id here, the letter l, differs from our previous transcription file, line. Even though we have adopted a different nickname, they are treated as equivalent because in each file we have defined l or line with the same IRI, http://dbpedia.org/resource/Line_(poetry). A computer that later looks for files with lines of poetry will not care about l and line, but will look at the underlying IRI that defines these terms. This exemplifies how linked data (see above) can support our work. We are free to use abbreviations and terms that make sense to us, yet we tie those abbreviations to IRIs that have valence outside our project.

Now that we have created the metadata for our transcriptions, we turn to the alignment files. Those <head>s will look slightly different. We start with the TAN-A-div file:

    <head>
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <master-location>ringoroses.div.1.xml</master-location>
       <license which="by_4.0"/>
       <licensor who="park"/>
       <source xml:id="eng-uk">
          <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
          <name>Transcription of ring around the roses in English (UK)</name>
          <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location>
       </source>
       <source xml:id="eng-us">
          <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
          <name>Transcription of ring around the roses in English (US)</name>
          <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location>
       </source>
       <definitions>
           <person xml:id="park">
               <IRI>tag:parkj@textalign.net,2015:self</IRI>
               <name xml:lang="eng">Jenny Park</name>
           </person>
           <role xml:id="creator">
               <IRI>http://schema.org/creator</IRI>
               <name xml:lang="eng">creator</name>
           </role>
       </definitions>
       <resp who="park" roles="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
    </head>

Much of the code above will look similar to the previous two examples. Every alignment file has only one kind of source, namely TAN transcription files, nothing else. Therefore <source>'s <IRI> always takes the @id value of the corresponding TAN transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other texts aligned in the file. <source> also has an child element not seen in the earlier two examples, <location>, which specifies where the digital file was accessed and when (through @when-accessed). We may include as many of these <location> elements as we wish, with the most preferred or reliable location at the top, since the validation process will use first document that is available. The @when-accessed value is important, because the validator will look for changes in the file, and if there have been changes since we last accessed the file, it will return a warning with a summary of the number and kind of changes. If such a report is returned, it is up to us to determine if the alterations merit any action on our part.

Our TAN-A-div file could have any number of <source>s, and not necessarily for the same work. It also does not matter in which order we put the <source>s. <definitions> is empty, mainly because we have, in this case, no working assumptions to declare. In more advanced uses, this element would not be empty.

This <head> explains why the <body> of our TAN-A-div file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. All TAN-A-div files assume, by default, that every source that is a version of the same work should be aligned upon the basis of the @n value of <div>s. That is, any user or processor of a TAN-A-div file may assume that all implicit alignments should be made unless otherwise specified.

For transcriptions that are already similarly structured and labeled, a TAN-A-div file is unnecessary for alignment. But we will see that the options available in a TAN-A-div's <definitions> and <body> will allow us not only to deal with inconsistencies in source transcriptions but to make important claims, e.g., where one work quotes from another.

Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> looks like this:

    <head>
        <name>token-based alignment of two versions of Ring o Roses</name>
        <master-location>ringoroses.01+02.token.1.xml</master-location>
        <license which="by-nc-nd_4.0" rights-holder="park"/>
        <source xml:id="ring1881">
            <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
            <name>Ring o roses 1881</name>
            <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location>
        </source>
        <source xml:id="ring1987">
            <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
            <name>Ring o roses 1987</name>
            <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location>
        </source>
        <definitions>
            <bitext-relation xml:id="B-descends-from-A">
                <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI>
                <name>B descends directly from A, unknown number of intermediaries</name>
                <desc>The 1987 versions is hypothesized to descend somehow from the 
                    1881 version, mainly for the sake of illustration.</desc>
            </bitext-relation>
            <reuse-type xml:id="adaptationGeneral">
                <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI>
                <name>general adaptation</name>
            </reuse-type>
            <token-definition src="ring1881 ring1987" which="letters"/>
            <person xml:id="park" roles="creator">
                <IRI>tag:parkj@textalign.net,2015:self</IRI>
                <name xml:lang="eng">Jenny Park</name>
            </person>
            <role xml:id="creator" which="creator"/>
        </definitions>
        <change when="2015-01-20" who="park">Started file</change>
    </head>

The TAN-A-tok <head> looks similar to the previous examples, except that <definitions> has some new content.

<bitext-relation> states through an IRI + name pattern the stemmatic relationship we think holds between the two sources. (Stemmatics is the study of the chain of transmission—the relationship of an original text-bearing object to the ones that survive. It frequently involves the creation of genealogical-like trees to illustrate the work's version history.) We have used the entire IRI + name pattern, but we could have substituted it with @which and the value a/x+/b.

One or more <reuse-type>s specify how one text has reused another. The IRI we have used shows that we believe that the later text has generally adapted the earlier one. If this were a translation or a quotation or some other kind of text reuse, we might have used a different IRI.

A third declaration, <token-definition>, specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. This element is optional. If we leave it out, users are to assume that we mean letters. This is because most often, whenever in ordinary conversation we refer to the nth word in a sentence we assume people will skip punctuation marks when they count.

The value for @which, letters, is a reserved TAN keyword that specifies that any consecutive string of word characters, ignoring spaces and punctuation. Under this token definition the phrase "Hush!" said he would have three tokens. Had we set the value of @which to the reserved TAN keyword letters and punctuation, we would have six tokens, since each punctuation mark would be defined as a token.