Creating TAN Metadata (<head>)

Creating TAN Metadata (<head>)
Prev	Chapter 2. Starting off with the TAN Format	Next

Creating TAN Metadata (`<head>`)

Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:

    <head>
        <name>TAN transcription of Ring a Ring o' Roses</name>
        <master-location>ring-o-roses.eng.1881.xml</master-location>
        <rights-excluding-sources rights-holder="park">
            <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI>
            <name>This data file is licensed under a Creative Commons Attribution 4.0 International
                License. The license is granted independent of any rights and licenses that may be 
                associated with the source. </name>
        </rights-excluding-sources>
        <source>
            <IRI>http://lccn.loc.gov/12032709</IRI>
            <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name>
        </source>
        <declarations>
            <work>
                <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
                <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name>
            </work>
            <div-type xml:id="line">
                <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
                <name>line of poetry</name>
            </div-type>
        </declarations>
        <agent xml:id="park" roles="creator">
            <IRI>tag:parkj@textalign.net,2015:self</IRI>
            <name>Jenny Park</name>
        </agent>
        <role xml:id="creator">
            <IRI>http://schema.org/creator</IRI>
            <name xml:lang="eng">creator</name>
        </role>
        <change when="2014-08-13" who="park">Started file</change>
    </head>

<name> is the human readable form of the @id that is inside the root element, <TAN-T>. It can be anything. And we can supply more than one <name>, in case we wish to provide it in different languages or variations.

<master-location> is mandatory only if we have claimed through @in-progress that the file is no longer in progress. One or more of these elements provide URLs where master versions of the file are kept (and updated). They may be absolute URLs, such as an address on the Internet, or it may be a relative URL, in case we are working exclusively on our local computer. We provide this as a courtesy to others who might be using our data. If someone downloads a copy and starts working with it, then whenever they validate the file, if it does not match the one in the master version, a warning is returned, along with a message or a location of the elements that were last changed. This allows users to found out if changes have been made, and it allows us to make corrections and silently notify other users of our alterations. To communicate this, we do not have to keep track of who is using the file.

<rights-excluding-sources> contains information about rights to the data we are releasing. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). This once again gets to the TAN metadata principle of describing our data and not other things. We have the option to describe the license of the source we have used (see the rest of the guidelines for guidance), but we absolutely must declare whether we have placed additional scrictures on the dataset we have created. That is, we are declaring the rights attached to the data, not its source. In this example, we have released the data under a creative commons license. The child element <IRI> specifies the IRI assigned by Creative Commons, and <desc> describes it in human-readable format.

The conjunction of <IRI> and <name>, the IRI + name pattern, is a recurrent feature of TAN files. We may include any number of <IRI> or <name> elements in an IRI + name pattern. But if we do so, we are stating that they all name the same thing, not different things.

<source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen.

<declarations> contains data that is specific to TAN file types, to declare the assumptions we have made relevant to the kind of data we have created. In this case, because we are working with transcriptions, we have two major components: <work> and <div-type>.

<work> uses the IRI + name pattern to name the work we have chosen to transcribe. <div-type> specifies the type of divisions we have chosen to use to segment the transcription. In a more complex text, there would be several <div-type>s. Each one has an @xml:id, which takes as a value some nickname that we wish to use for @type values of <div>s.

The IRI + name pattern is also used for <agent>, which describes who was involved in creating the data, and <role>. We may have as many <agent>s and <role>s as we wish. The agent in this case, Jenny Park, has been given a tag URI. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles.

Note

	Note
If you decide to modify someone else's TAN file, then you become responsible for changes, not the original person or organization. Your first point of order should be add an `<agent>` to the head, identifying yourself. You need not change the document's `@id`, but you should take responsibility for any changes you make, probably using `<change>` or an `@ed-who` and an `@ed-when`. Otherwise you are incorrectly attributing your changes to someone else.

If you decide to modify someone else's TAN file, then you become responsible for changes, not the original person or organization. Your first point of order should be add an <agent> to the head, identifying yourself. You need not change the document's @id, but you should take responsibility for any changes you make, probably using <change> or an @ed-who and an @ed-when. Otherwise you are incorrectly attributing your changes to someone else.

Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who created the source. If someone wants that information, or anything else about the source, they should pursue the identifier we have provided under <source>.

<change> has attributes @when and @who that specify who made the change/comment and when. The value of @when is always a date plus optional time formatted according to the standard YYYY-MM-DD + time (optional). @who always carries a value that refers to an agent/@xml:id. Both <change> (as well as <comment>, missing here) lack any IRIs, mainly because the likelihood that the data would ever be reused, repeated, or linked to is altogether too remote to be make a mandated <IRI> useful.

So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of nice shortcuts:

    <head>
      <name>TAN transcription of Ring around the Rosie</name>
      <master-location>ring-o-roses.eng.1987.xml</master-location>
      <rights-excluding-sources which="by-nc-nd_2.0" rights-holder="park"/>
      <source>
         <IRI>http://lccn.loc.gov/87042504</IRI>
         <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name>
      </source>
      <declarations>
         <work>
            <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
            <name>Ring around the Rosie</name>
         </work>
         <div-type xml:id="l" which="half-line (verse)"/>
         <filter>
            <normalization which="no hyphens"/>
         </filter>
      </declarations>
      <agent xml:id="park" roles="creator">
         <IRI>tag:parkj@textalign.net,2015:self</IRI>
         <name xml:lang="eng">Jenny Park</name>
      </agent>
      <role xml:id="creator" which="creator"/>
      <change when="2014-10-24" who="park">Started file</change>
      <comment when="2014-10-24" who="park">See p. 39 of source.</comment>
   </head>

One significant difference is that three of the elements that normally take the the section called “IRI + name Pattern” have been replaced with a simpler form that takes merely @which and @xml:id. That is because TAN has predefined vocabulary that can be invoked by calling it (through @which) and giving it an abbreviation to be used elsewhere in the document (@xml:id).

<declarations> has a new child, <filter>, which contains a <normalization> statement that declares, through the name and the IRI in the underlying TAN definition, that we have opted to remove word-break line-end hyphenation. This provides a cautionary note to users of our data who might value line-end hyphenation. Any number of <normalization>s can be used to describe any alterations we might have made in our transcription. In other transcriptions we could use this feature to declare other suppressions, such as editorial comments or footnote signals.

Note that the value of div-type/@xml:id here, the letter l, differs from our previous transcription file, line. Even though we have adopted a different nickname, they are treated as equivalent because in each file we have defined l or line with the same IRI, http://dbpedia.org/resource/Line_(poetry). A computer that later looks for files with lines of poetry will not care about l and line, but will look at the underlying IRI that defines these terms. This exemplifies how linked data (see above) can support our work. We are free to use abbreviations and terms that make sense to us, yet we can also tie those abbreviations into the larger infrastructure by means of IRIs. It also means that we can tether our texts to others on the basis of segmentns that may be generally rare and unfamiliar or common but only to a specific field (e.g., sections of a legal document).

Now that we have created the metadata for our transcriptions, we turn to the alignment files. Those <head>s will look slightly different. We start with the TAN-A-div file:

    <head>
       <name>div-based alignment of multiple versions of Ring o Roses</name>
       <master-location>ringoroses.div.1.xml</master-location>
       <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
       <source xml:id="eng-uk">
          <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
          <name>Transcription of ring around the roses in English (UK)</name>
          <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location>
       </source>
       <source xml:id="eng-us">
          <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
          <name>Transcription of ring around the roses in English (US)</name>
          <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location>
       </source>
       <declarations/>
       <agent xml:id="park" roles="creator">
          <IRI>tag:parkj@textalign.net,2015:self</IRI>
          <name xml:lang="eng">Jenny Park</name>
       </agent>
       <role xml:id="creator" which="creator"/>
       <change when="2014-08-14" who="park">Started file</change>
    </head>

Much of the code above will look similar to the previous two examples. Every alignment file has only one kind of source, namely TAN transcription files, nothing else. Therefore <source>'s <IRI> always takes the @id value of the corresponding TAN transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other texts aligned in the file. <source> also has an child element not seen in the earlier two examples, <location>, which specifies where the digital file was accessed and when (through @when-accessed). We may include as many of these <location> elements as we wish, with the most preferred or reliable location at the top, since the validation process will use first document that is available. The @when-accessed value is important, because the validator will look for changes in the file, and if there have been changes since we last accessed the file, it will return a warning with a summary of the number and kind of changes. If such a report is returned, it is up to us to determine if the alterations merit any action on our part.

Our TAN-A-div file could have any number of <source>s, and not necessarily for the same work. It also does not matter in which order we put the <source>s. <declarations> is empty, mainly because we have, in this case, no working assumptions to declare. In more advanced uses, this element would not be empty.

This <head> explains why the <body> of our TAN-A-div file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. All TAN-A-div files assume, by default, that every source that is a version of the same work should be aligned upon the basis of the @n value of <div>s. That is, any user or processor of a TAN-A-div file may assume that all implicit alignments should be made unless otherwise specified.

For transcriptions that are already similarly structured and labeled, a TAN-A-div file is unnecessary for alignment. But we will see that the options available in a TAN-A-div's <declarations> and <body> will allow us not only to deal with inconsistencies in source transcriptions but to make important statements, such indicating where one work quotes from another.

Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> looks like this:

    <head>
        <name>token-based alignment of two versions of Ring o Roses</name>
        <master-location>ringoroses.01+02.token.1.xml</master-location>
        <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
        <source xml:id="ring1881">
            <IRI>tag:parkj@textalign.net,2015:ring01</IRI>
            <name>Ring o roses 1881</name>
            <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location>
        </source>
        <source xml:id="ring1987">
            <IRI>tag:parkj@textalign.net,2015:ring02</IRI>
            <name>Ring o roses 1987</name>
            <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location>
        </source>
        <declarations>
            <bitext-relation xml:id="B-descends-from-A">
                <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI>
                <name>B descends directly from A, unknown number of intermediaries</name>
                <desc>The 1987 versions is hypothesized to descend somehow from the 
                    1881 version, mainly for the sake of illustration.</desc>
            </bitext-relation>
            <reuse-type xml:id="adaptationGeneral">
                <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI>
                <name>general adaptation</name>
            </reuse-type>
            <token-definition src="ring1881 ring1987" which="letters"/>
        </declarations>
        <agent xml:id="park" roles="creator">
            <IRI>tag:parkj@textalign.net,2015:self</IRI>
            <name xml:lang="eng">Jenny Park</name>
        </agent>
        <role xml:id="creator" which="creator"/>
        <change when="2015-01-20" who="park">Started file</change>
    </head>

The TAN-A-tok <head> looks similar to the previous examples, except that <declarations> has three children.

<bitext-relation> states through an IRI + name pattern the stemmatic relationship we think holds between the two sources. (Stemmatics is the study of the chain of transmission by a single work eventually became the multiple copies, versions, and editions that are extant; it frequently involves the creation of genealogical-like trees to illustrate the work's version history.) We have used the entire IRI + name pattern, but we could have substituted it with @which and the value a/x+/b.

One or more <reuse-type>s specify how one text has reused another. The IRI we have used shows that we believe that the later text has generally adapted the earlier one. If this were a translation or a quotation or some other kind of text reuse, we might have used a different IRI.

A third declaration, <token-definition>, specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources.

The value for @which, letters, is a reserved TAN keyword that specifies that any consecutive string of word characters, ignoring spaces and punctuation. Under this token definition the phrase "Hush!" said he would have three tokens. Had we set the value of @which to the reserved TAN keyword letters and punctuation, we would have six tokens, since each punctuation mark would be defined as a token.

<token-definition> is optional. If we leave it out, users are to assume that we mean letters. This is because most often, whenever in ordinary conversation we refer to the nth word in a sentence we assume people will skip punctuation marks in their counting.

Prev	Up	Next
The Principles of TAN Metadata (<head>)	Home	Aligning across Projects