Chapter 2. Starting off with the TAN Format

Chapter 2. Starting off with the TAN Format
Prev	Part I. General Overview	Next

If you are new to markup languages, or if you are unfamiliar with acronyms such as XML, RDF, XPath, or technical terms such as Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and edit TAN files. By the end of this chapter, you will be able to create and edit a simple collection of TAN transcriptions and alignments. If you are familiar with basic markup concepts, you may wish to read through the chapter very quickly, or skip it altogether.

The discussion touches on a number of general concepts, some of which may be new. These concepts will be introduced only briefly. Further reading elsewhere will give you better grounding in a particular topic or technology.

Creating TAN Transcription and Alignment Data

Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that someone might create.

We begin by finding previously published versions. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the UK and the other, the US). Each of these books have other rhymes, but we've already decided to focus upon the one particular nursery rhyme, so we transcribe those parts and nothing else:

Table 2.1. Ring around the Rosie

1881 (UK) version	1987 (US) version
Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down.	Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

1881 (UK) version

1987 (US) version

Ring-a-ring-a-roses,

A pocket full of posies;

Hush! Hush! Hush! Hush!

We're all tumbled down.

Ring-a-round the rosie,

A pocket full of posies,

Ashes! Ashes!

We all fall down.

We must be sure to save each of the two transcriptions as plain Unicode text, preferably with .xml at the end of each file name. Do not bother with word processor (Word, OpenOffice, Google Docs, and so forth), because those programs are too sophisticated for our work. They sometimes generate erroneous data, even when you export to plain text. We will be working with raw text, and will not be concerned with italics, colors, fonts, margins, and so forth. Much better for our work is a text editor, which handles nothing but plain text. But even those are inadequate, because they do not check to see if the rules of the format have been followed. So the best tool is an XML editor, which does the same thing a text editor does, but with shortcuts that save much typing and prevents syntax errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide information and help in our TAN files.

	Note
Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists for both text editors and XML editors. TAN was developed using oXygen, which is so powerful it may be very confusing to use at first. To avoid exasperation or despair, take advantage of tutorials and documentation associated with the XML editor you have chosen.

Note

Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists for both text editors and XML editors. TAN was developed using oXygen, which is so powerful it may be very confusing to use at first. To avoid exasperation or despair, take advantage of tutorials and documentation associated with the XML editor you have chosen.

Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few the necessary preliminary lines that stand above both the head and the body. First, the 1881 (UK) version:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring01">
    <head>
    . . . . . . .
    </head>
    <body xml:lang="eng" in-progress="false">
        <div type="line" n="1">Ring-a-ring-a-roses,</div>
        <div type="line" n="2">A pocket full of posies;</div>
        <div type="line" n="3">Hush! Hush! Hush! Hush!</div>
        <div type="line" n="4">We're all tumbled down.</div>
    </body>
</TAN-T>

And now the 1987 (US) version:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring02">
   <head>
   . . . . . . .
   </head>
   <body xml:lang="eng" in-progress="false">
      <div type="l" n="1">Ring-a-round the rosie,</div>
      <div type="l" n="2">A pocket full of posies,</div>
      <div type="l" n="3">Ashes! Ashes!</div>
      <div type="l" n="4">We all fall down.</div>
   </body>
</TAN-T>

These are standard eXtensible Markup Language (XML) files. (If you are already familiar with XML you may wish to skip ahead to the next section.) XML is rather simple. It provides a way to take a text or a collection of data and give it some structure through markup. In the examples above, the markup is in boldface.

Each file begins with a prolog, marked by the lines that begin with <?. The first line in the prolog simply states that what follows is an XML document. The next two lines point to the files that will be used to check to see whether or not our data is valid. For now we will skip the specific details of those first three lines, which will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste those lines when we want to start a new one.

The fourth line is the opening tag of what is called the root element, here called <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. The paired-tag relationship is true for all the other elements in this example. <head> is answered by </head>, <body> by </body> and each <div...> by </div>. These elements nest within or beside each other, but they never overlap. (The prohibition on overlapping elements is one of the cardinal rules of XML.) This relationship means that every XML file can be thought of as a tree, with the root at the trunk and the enveloped elements as branches, terminating in metaphorical leaves. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In this manual, we may use the terms rootward and leafward when we want to trace movement within an XML document.

An XML document is also profitably thought of as a family tree, a metaphor that provides commonly used terminology. In our examples above, <TAN-T> is the parent of <body>, and <body> the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>.

Inside of the opening tags for the <TAN-T>, <body>, and <div> elements are pairs of text joined by an equals sign, collectively called an attribute. The left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. <TAN-T> has two attributes, @xmlns and @id (when we discuss an attribute outside its original context, we often preface the name with @). We will skip @xmlns for now; this attribute (actually, a pseudo-attribute) specifies the namespace of the XML file, a somewhat advanced topic.

The value of @id, however, is quite important and our first item of business. Every TAN file has an @id that uniquely and permanently identifies the file itself. It is quite similar to the name we give a file when we save it, and to the names we see when we browse the local contents of our computer, except that it should not be changed from one revision to the next. When we want to record changes to our file, we will not alter the @id value, but simply note the change elsewhere in the document (see below).

The value of @id is always what is called a tag uniform resource name (tag URN). It always starts with tag:, followed by an email address or domain name that we own or owned. (It is okay to use an obsolete address.) After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the international standard format of year, month, and date, joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be the first of the month; if we leave off the month value it is assumed to be January. In the examples above, [USER@DOMAIN.NET],2014 indicates that the email address was owned on the stroke of midnight (Coordinated Universal Time) January 1, 2014. After that comes a colon, and then any name we wish to assign to the file.

We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. (If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., [USER@DOMAIN.NET],2014-01-02.)

The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. (See later in the guide for more complex language requirements.) By saying that @in-progress is false, we indicate that we have finished our transcription and have no further plans to develop it. It doesn't mean that the file is free of errors. We will can make corrections later. It just means that we have no more revisions planned, and any further changes will be restricted to corrections of errors. This attribute is optional. If it is left off, our TAN file is assumed to be a work in progress, and it serves as a kind of warning to anyone who might want to use it.

Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leafmost <div>. That is, every <div> must be the parent of only other <div>s, or none at all. We cannot have a <div> that mixes text with other elements (such as other <div>s). The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will make intuitive sense to other users. The choice is arbitrary (we will see why below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. We could have used Roman numerals, or some other naming scheme that is standard in the field.

Aside from the <head> element (discussed later), that's all we need in the transcription. We can now move to alignment.

There are two different types of alignment, one emphasizing breadth, the other, depth. The broad type of alignment, called TAN-A-div, allows us to specify TAN transcriptions of as many versions of as many works as we wish, and to fine-tune the alignment upon the basis of the <div> elements within the transcription. We do not specify why we wish to align the versions. We only declare our interest in doing so. The other type of alignment, emphasizing depth, is called TAN-A-tok and allows us to take any two (and no more) TAN transcriptions, create word-to-word (or better put, token-to-token) relationships, and specify what type of relationship holds between each set of aligned words. TAN-A-div is suitable for work that focuses on the general alignment of multiple versions of one or more works at a single time. TAN-A-tok is for highly detailed, precise alignment of two text versions.

For our example, we start with a TAN-A-div file (once again suppressing <head>):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    <head>
    . . . . . . .
    </head>
    <body/>
</TAN-A-div>

In the prolog, the first line is identical to the first line of our transcription files. The second and third lines are identical, aside from pointing to the validation files for alignment. Even the fourth line looks like the transcription file, other than the new name for the root element, <TAN-A-div>, and the new value for @id.

The penultimate line, <body/>, is what is called an empty element, and is equivalent to <body></body>. Collapsing the opening and the closing tags of the element into a single tag provides a shorthand syntax for elements contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty.

The other kind of alignment, TAN-A-tok, takes a bit more work, because we must first identify words that correspond with each other. Even before we do that, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what parts of the the older version correspond to those of the newer one. We will simplify in this case, and assume an interest only in words, ignoring space and that punctuation. We will also adopt, tokens instead of words (word is notoriously difficult to define, and has connotations lacking from token).

We now create a TAN-A-tok file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    <head>
    . . . . . . .
    </head>
    <body bitext-relation="B-descends-from-A" reuse-type="adaptation" in-progress="false">
        <!-- Examples of picking tokens by number -->
        <align>
            <tok src="ring1881" ref="1" ord="1"/>
            <tok src="ring1987" ref="1" ord="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" ord="2"/>
            <tok src="ring1987" ref="1" ord="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" ord="3"/>
            <tok src="ring1987" ref="1" ord="3"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" ord="4"/>
            <tok src="ring1987" ref="l" ord="4"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" ord="5"/>
            <tok src="ring1987" ref="1" ord="5"/>
        </align>
        <!-- Examples of picking tokens by value -->
        <align>
            <tok src="ring1881" ref="2" val="A"/>
            <tok src="ring1987" ref="2" val="A"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="pocket"/>
            <tok src="ring1987" ref="2" val="pocket"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="full"/>
            <tok src="ring1987" ref="2" val="full"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="of"/>
            <tok src="ring1987" ref="2" val="of"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="posies"/>
            <tok src="ring1987" ref="2" val="posies"/>
        </align>
        <!-- Examples of picking ranges of tokens -->
        <align>
            <tok src="ring1881" ref="3" ord="1, 2"/>
            <tok src="ring1987" ref="3" ord="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="3" ord="3 - 4"/>
            <tok src="ring1987" ref="3" ord="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" ord="1"/>
            <tok src="ring1987" ref="4" ord="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" ord="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" ord="3"/>
            <tok src="ring1987" ref="4" ord="2"/>
        </align>
        <!-- examples of using "last" -->
        <align>
            <tok src="ring1881" ref="4" ord="last-1"/>
            <tok src="ring1987" ref="4" ord="last-1"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" ord="last"/>
            <tok src="ring1987" ref="4" ord="last"/>
        </align>
    </body>
</TAN-A-tok>

Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>) and the value of @id.

The heart of the data is <body>, which has, in addition to @in-progress, two more attributes, @reuse-type, which specifies the default type of relationship between the two sources, and @bitext-relation, which specifies how the versions relate to each other. Our two values, B-descends-from-A and adaptation, are arbitrary names that we define in the <head> (discussed later).

<body> is the parent of one or more <align> elements, each of which correlates a set of tokens in the two texts through its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 but we could have just as easily used anything else such as uk and us. @ref has a value that points to a specific <div> in the source transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and can be mixed, as in the example. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a word token relative to the last one in a <div>.

Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between the two texts. Words may feature in multiple <align> elements (that is, overlapping is permissible). And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a half-null alignment. This half-null alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation (which is, as we shall see, defined in the <head>). If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A half-null alignment containing only tokens of the later source might point to words that the translator added.)

A half-null alignment should not be confused with our own silence. As creators of this file, we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them.

We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and signal the change to users of our data. There are also ways to express doubt or alterative opinions. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed in the detailed parts of these guidelines.

Prev	Up	Next
Participation	Home	The Principles of TAN Metadata (<head>)