Creating TAN metadata (<head>)

Creating TAN metadata (<head>)
Prev	Chapter 2. Starting off with the TAN format	Next

Creating TAN metadata (`<head>`)

Now that we have explored various IRI vocabularies for concepts related to our files concerning Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:

<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring01">
    <head>
        <name>TAN transcription of Ring a Ring o' Roses</name>
        <master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
        <license licensor="park">
            <IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
            <name>Attribution 4.0 International</name>
        </license>
        <work>
            <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
            <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name>
        </work>
        <source>
            <IRI>http://lccn.loc.gov/12032709</IRI>
            <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name>
        </source>
        <vocabulary-key>
            <person xml:id="park">
                <IRI>tag:parkj@textalign.net,2015:self</IRI>
                <name>Jenny Park</name>
            </person>
            <div-type xml:id="line">
                <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>
                <name>line of poetry</name>
            </div-type>
            <role xml:id="creator">
                <IRI>http://schema.org/creator</IRI>
                <name xml:lang="eng">creator</name>
            </role>
        </vocabulary-key>
        <file-resp who="park"/>
        <resp roles="creator" who="park"/>
        <change when="2014-08-13" who="park">Started file</change>
        <to-do/>
    </head>
    . . . . . . .
</TAN-T>

<name>, the human readable counterpart to the @id that is inside the root element, can be anything. And we can supply more than one <name>, in case we wish to provide alternative names of the file, or translations of them.

One or more <master-location>s provide URLs where master versions of the file are kept (and maintained). We provide this as a courtesy to others who might be using our data. Anyone who validates their local copy of the file will be warned if it does not match the master version, and they will be told of the most recent changes. With a couple of keystrokes, they can update their local copy to match the master. This one-way communication system lets us silently and conveniently notify other users of changes. We do not have to keep track of who is using our file, and users do not have to pester us with questions about what changed when.

<master-location> is mandatory only if we are finished with our to-do list, which is specified at <to-do>. If that element is empty, then we imply that we do not know anything further that should be done to the file. Conversely, any elements in <to-do> specify what remains to be done, and details will be returned to other users. That way you can release data that is useful but not completely perfect, and let users know about its deficiencies. This approach is ideal for formats such as TAN-A-tok, where you might have released only some of the data, and you are working on the rest.

One day the link in <master-location> will be dead. But perhaps a copy of our file will be in circulation elsewhere. The document @id in the root element provides a way to identify files, independent of links, and perhaps locate them in unexpected places.

<license> specifies the license under which we are releasing our data. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). That is, we are specifying what rights are attached to the data, not its source, i.e., if we have placed additional strictures on the content in <body>. In this example, we have released the data under a creative commons license. The child element <IRI> specifies a Creative Commons IRI, and <name> is the human-readable form.

@licensor specifies who has granted the license, in this case our fictive Jenny Park (see below).

The conjunction of <IRI> and <name>, the IRI + name pattern, recurs throughout TAN files. They are used provide identifiers for vocabulary items. In an element that takes the IRI + name pattern, we may include as many children <IRI>s or <name>s as we like. But if we do so, we are stating that they are synonymous, i.e., that they all name the same thing. (Once again, an IRI is unique, so it should never be used to identify more than one thing.)

<work> uses the IRI + name pattern to name the work we have chosen to transcribe. <source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen.

<vocabulary-key> contains vocabulary that we are using in our file. Inside, we can place more vocabulary items, and attach locally unique ids. For example, an IRI + name pattern is used for <person>, which identifies through a tag URN Jenny Park. The value of @xml:id allows us to use park any time we want to mention Jenny. In fact, we already have, at @licensor. Any mention of park will point to the appropriate item in <vocabulary-key>.

There are a few other parts of <vocabulary-key>. <div-type> specifies an IRI + name pattern for line divisions, and the value of @xml:id means that we can use line any time we want to invoke the concept. Similarly, we have a <role>. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles.

After the <vocabulary-key>, we get into parts of the file that specify who did what, when. First is a <file-resp>, whose value of @who, park, indicates that Jenny Park is the one primarily responsible for the file. <resp> specifies further who was responsible for doing what.^[6]

Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who was responsible for the printed source. If someone wants to know anything about the book, they should pursue the IRI identifier we have provided under <source>.

<change> has attributes @when and @who to specify who made the change and when. The value of @when is always a date or a date + time, formatted according to the ISO standard syntax: [YYYY]-[MM]-[DD] or [YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]. @who always carries an IDref that points to a person or organization. <change> does not take the IRI + name pattern, or even any children at all. It takes simply a plain-text description of what changed.

So now we have finished one transcription file's metadata. You may have found it to represent a lot of typing: many names, IRIs, and so forth. Is there any way to shorten that load? Yes, there is. TAN is a vocabulary-based format. That is, there are standard vocabulary items that come with the TAN format, and you can design your own vocabulary, so that you can shorten the work involved, and to adhere to the best DRY principles.

Our second example will look similar to the first one, but notice some shortcuts:

<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring02">
    <head>
      <name>TAN transcription of Ring around the Rosie</name>
      <master-location>ring-o-roses.eng.1987.xml</master-location>
      <license which="by 4.0" licensor="park"/>
      <work>
         <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI>
         <name>Ring around the Rosie</name>
      </work>
      <source>
         <IRI>http://lccn.loc.gov/87042504</IRI>
         <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name>
      </source>
      <adjustments>
         <normalization which="no hyphens"/>
      </adjustments>
      <vocabulary-key>
         <div-type xml:id="l" which="line (verse)"/>
         <person xml:id="park" roles="creator">
            <IRI>tag:parkj@textalign.net,2015:self</IRI>
            <name xml:lang="eng">Jenny Park</name>
         </person>
      </vocabulary-key>
      <resp roles="creator" who="park"/>
      <change when="2014-10-24" who="park">Started file</change>
      <comment when="2014-10-24" who="park">See p. 39 of source.</comment>
      <to-do/>
   </head>
   . . . . . .
</TAN-T>

In this example, <name>, <master-location>, and <source> have been modified to describe this file. Note, we haven't had to change <work>.

<license> looks different, but in reality it is identical to our previous example, and that is because the IRI + name pattern has been replaced with @which. You may replace any IRI + name pattern with @which; its value must match a <name> in customized or standard vocabulary (a TAN-voc file). In this case, "by 4.0" points to TAN's standard vocabulary for licenses (see the section called “TAN keywords for types of rights (<license>)”). Here is what that looks like under the hood:

<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:textalign.net,2015:tan-voc:licenses">
    . . . . . . .
   <body affects-element="license">
      <item>
         <IRI>http://creativecommons.org/licenses/by/4.0/</IRI>
         <IRI>tag:textalign.net,2015:license:by/4.0/</IRI>
         <name>by 4.0</name>
         <desc>attribution 4.0 international</desc>
      </item>
    . . . . . . .
   </body>
</TAN-voc>

Because the validation rules for TAN-voc files require every <name> to be unique, that element can be treated as a unique identifier, similar to @xml:id. We could have repeated the <license> from the previous TAN-T file. But the @which method is much quicker, cleaner, and DRY.

Before <vocabulary-key> comes a new element, <adjustments>, which contains a <normalization> statement whose @which says no hyphens. That too points to a standard TAN vocabulary for normalizations: an IRI + name pattern for eliminating discretionary hyphens (see the section called “TAN keywords for types of normalizations (<normalization>)”). Here's what that vocabulary item looks like (invisible to you, but you can look at it any time you like in the vocabularies subdirectory of the TAN files):

<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:normalizations">
    . . . . . . .
   <body affects-element="normalization">
      <item>
         <IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed</IRI>
         <name>no hyphens</name>
         <desc>Discretionary word-break line-end hyphens have been deleted.</desc>
      </item>
    . . . . . . .
   </body>
</TAN-voc>

As you might have inferred, the element <normalization> specifies how we have changed the data, namely, that we have opted to remove word-break line-end hyphenation. In other transcriptions we could use <normalization> to declare other kinds of changes we felt compelled to make, such as removing editorial comments or footnote signals. A healthy list of <normalization>s is a courtesy to users of our data, some of whom might passionately care about keeping or removing line-end hyphenation.

Back to our example. <div-type> has a new value for @xml:id, the letter l, and in it too the IRI + name pattern has been replaced by @which, whose value, line (poetry), is a standard vocabulary item (see the section called “TAN keywords for types of divisions (<div-type>)”.^[7]

There is a also new <comment> element, which is built much the same as <change>. (A <change>, after all, is just a comment about what has been changed.)

That seems to be all there is. But if you've been attentive, you will have noticed that <role> from our first TAN-T file (inside <vocabulary-key>) is missing. That's because we don't need it, based on the same principle that lets us resolve @which. A vocabulary <name> can be invoked not only in @which, but in any attribute that points to values of @xml:id, in this case @roles. There is already a standard TAN vocabulary item with the <name> creator, so we can use it directly without having to declare an intermediate vocabulary item with an @xml:id. If we had defined something else in <vocabulary-key> with a @xml:id of creator, that item would take precedence and override the built-in TAN vocabulary. But we haven't, so the standard TAN vocabularies are the default.

^[6]If you decide to modify someone else's TAN file, you should credit / blame yourself for the changes. Your first point of order should be to add a <person> to the <vocabulary-key>, identifying yourself. You can then either add a <change> (see below) or a <resp> (you might need to specify a <role> in the <vocabulary-key>). You should not change the document's @id, unless your changes are so significant that it becomes altogether a new document, your document. TAN does not try to broker the age-old problem of determining the point at which a thing becomes something altogether different (e.g., the Ship of Theseus problem). Use your best intuition.

^[7]A line of poetry is to be contrasted with a physical line on the page. Some lines of poetry take up two or more physical lines. For the physical line you would specify: which="line (physical)".