Chapter 2. Starting off with the TAN format

Chapter 2. Starting off with the TAN format
Prev	Part I. General overview	Next

If you think you are ready to jump in and get going, try the section called “Installation and local setup”. But if you are new to markup languages, or unfamiliar or uncomfortable with acronyms and technical terms such as XML, RDF, XPath, and Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and edit TAN files, and to introduce new terminology. By the end of this chapter, you will have a sense of how to create and edit a small collection of TAN transcriptions and alignments.^[1]

The chapter touches on a number of general concepts that are discussed only briefly. If you find a particular term new or confusing, follow the prompts for further reading. If you are already familiar with basic markup concepts, you should at least skim through the chapter, because TAN approaches some old problems in new ways.

Creating TAN transcription and alignment data

Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that might appear.^[2]

We begin by finding previously published versions that haven't been digitized. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the U.K. and the other, the U.S.). Each of these books have other rhymes, but we've decided to focus upon one nursery rhyme, so we type up (transcribe) that poem and nothing else:

Table 2.1. Ring around the Rosie

1881 (U.K.) version	1987 (U.S.) version
Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down.	Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

1881 (U.K.) version

1987 (U.S.) version

Ring-a-ring-a-roses,

A pocket full of posies;

Hush! Hush! Hush! Hush!

We're all tumbled down.

Ring-a-round the rosie,

A pocket full of posies,

Ashes! Ashes!

We all fall down.

We must be sure to save each of the two transcriptions as plain text. Do not bother with a word processor (Word, OpenOffice, Google Docs, and so forth), which is too fancy for our needs. Word processors sometimes generate erroneous data, even when you export to plain text. And we are not concerned with italics, colors, fonts, margins, and so forth. We would be better off with a text editor, which opens and saves only text. But even those do not check to see if the rules of the TAN format have been followed. So the best tool is an XML editor, which like a text editor takes and creates only text. An XML editor is designed to follow the rules of XML, and so saves a lot of typing, and prevents many errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide important help as we edit.^[3]

Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few of the necessary preliminary lines that stand at the opening of the file, before both the head and the body. First, the 1881 (U.K.) version:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
    type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring01">
    <head>
    . . . . . . .
    </head>
    <body xml:lang="eng">
        <div type="line" n="1">Ring-a-ring-a-roses,</div>
        <div type="line" n="2">A pocket full of posies;</div>
        <div type="line" n="3">Hush! Hush! Hush! Hush!</div>
        <div type="line" n="4">We're all tumbled down.</div>
    </body>
</TAN-T>

And now the 1987 (U.S.) version:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:parkj@textalign.net,2015:ring02">
   <head>
   . . . . . . .
   </head>
   <body xml:lang="eng">
      <div type="l" n="1">Ring-a-round the rosie,</div>
      <div type="l" n="2">A pocket full of posies,</div>
      <div type="l" n="3">Ashes! Ashes!</div>
      <div type="l" n="4">We all fall down.</div>
   </body>
</TAN-T>

The examples above are in eXtensible Markup Language (XML). XML lets you take a text or a collection of data and structure it with angle brackets, < and >. In the examples above, such markup is in boldface.

Each file begins with a prolog, the first few lines that begin with <?. The first line simply states that what follows is an XML document. The next two lines in each example are processing instructions that point to the schemas: files that will be used to check to see whether or not our XML follows TAN rules, a process called validation. We will skip the details of those first five lines. They will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste them when we want to start a new TAN file.

After the prolog comes an opening tag, signified by an angle bracket followed by a letter, here <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. An opening tag and a closing tag mark the beginning and the end of one of the most important parts of an XML document, the element. For now, you can think of an element as a chunk of data. Every element is marked by a pair of tags. In this example, <head> is answered by </head>, <body> by </body> and each <div...> by </div>. Any element that has an opening tag must have a closing tag. If an element doesn't have anything between its opening and closing tags, the two of them can be collapsed into a single tag. That is, <a></a> can be simplified to <a/> (such empty elements are illustrated below).

Elements and processing instructions are two of the seven basic XML ingredients, called nodes. The other five node types are text, comment, attribute, namespace, and document, some of which we will meet below. The element node is arguably the most important type. You will see it most often, and it is absolutely required for anything to be well-formed XML. Every XML file must have at least one element. (But it does not have to have attributes, text, comments, or processing instructions.)

Elements nest within or beside each other, but they never overlap or interlock. That is, you cannot have <a><b>overlap</a></b>. The prohibition on overlapping elements is one of the cardinal rules of XML. The no-overlap rule keeps XML files tidy, and makes it easier for developers to write efficient applications.

Any two nearby elements normally relate to each other either by one nesting inside the other or by one being adjacent to the other. Because of these different close relationships, every XML file can be thought of as a tree, with the root at the trunk and the nested elements as branches, terminating in metaphorical leaves—those elements that do not contain any other elements. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In these guidelines, we may use the terms rootward and leafward when we want to trace movement up and down the levels of hierarchy in an XML document. You may also encounter the corresponding terms outermost and innermost. The metaphor is strengthened by the XML rule that there can be but only one root element, i.e., the element that contains all other elements and is contained by none. In our examples above the root element is named TAN-T.

An XML document tree can also be profitably thought of as a family. Family names provide the most common terminology to describe the relationship between elements. In our examples above, <TAN-T> is the parent of <body>, and <body> is the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>. The terms "following" and "preceding" are the most common ways to describe the relationship of one sibling to another.

You may notice that some characters are inside opening tags, but not closing ones. In the opening tags for the <TAN-T>, <body>, and <div> elements there appear sets of pairs: a word and something within quotation marks, each of them separated by an equals sign. These stretches of text are called attributes. On the left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. In the example above <TAN-T> has three attributes, @xmlns, @TAN-version, and @id (it is customary to signal attributes by writing @). We will skip @xmlns for now. It looks like an attribute, but it's really a pseudo-attribute, because it specifies the namespace of the XML file. Namespaces are an important but advanced topic, not discussed in this chapter. (See the section called “Namespaces”.)

The value of @TAN-version indicates that the 2021 version of TAN is being used.

@id is quite important. Every TAN file has an @id that uniquely names and permanently identifies the document itself. It should not be changed, even if we make edits. If you change the filename or a copy of it winds up being incorporated into another project, a stable @id will be quite important for finding it. An @id should be unique. The only time the value should be repeated in a file is when you are pointing to another version of the same file.

In the <TAN-T>, the value of @id must always be what is called a tag uniform resource name (tag URN). A tag URN begins with tag:, followed by an email address or domain name that we own or owned. It is okay to use an obsolete address or domain; its purpose is to allow users to identify you, perhaps centuries from now, not to contact you. But you can use a current email address if you want to be contacted by those who use your file. After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the form of numbers for the year, year + month, or year + month + date, each item joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be 01, the first of the month; if we leave off the month value it is assumed to be 01, January.

In the examples above, parkj@textalign.net,2015 points to our fictive self, Jenny Park, who owned that particular email address on the stroke of midnight (Coordinated Universal Time) January 1, 2015. After that comes a colon, and then any name we wish to assign to the file.

We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., parkj@textalign.net,2015-01-02. Or we could change the way we build our tag URNs.

Tag URNs are very useful. You do not need permission to create one. You don't need to register them. You are in control. You also signal who is responsible for the file. Hundreds of years from now, when that email will be defunct or perhaps owned by someone else, users might still be able to identify who was responsible.

The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. We could have used en, but the 2-letter convention supports only a handful of languages. (See the section called “Languages” for more.)

Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leaf <div> (i.e., one that contains no other <div>). Every <div> must be the parent of only other <div>s, or none at all. No <div> may mix text and other elements. An exception is made for text that is nothing but space (the space bar, the tab, or the new line). Space-only text can be mixed with elements as needed, which means that a TAN file can be indented however you like, without changing its meaning.

The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will be intuitive to other users. The value is arbitrary, but gets explained by what is in the header (we will how below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. Here we've opted for a reference system that seems intuitive and will most likely apply to multiple versions of the work. But the Arabic numerals are not required. We could have used Roman numerals, or some other numbering or naming scheme that is standard in the field. The idea is to use the term that is most like what other people encoding a different version of the same text might use.

Aside from the <head> element (discussed below), that's all we need in the TAN-T transcription. We can now move to alignment and annotation.

We now turn to a second TAN format, TAN-A. Whereas the first two examples, TAN-T, had to do with texts and transcriptions, TAN-A has to do with alignment and annotation. The TAN-A format allows us to align and annotate as many transcriptions as we wish, and to make claims about them. Let's begin, once again temporarily skipping <head>. Significant differences from the previous two TAN-T files are emphasized:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A.rnc" 
    type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN.sch" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring-alignment">
    <head>
    . . . . . . .
    </head>
    <body/>
</TAN-A>

In the prolog, the first line is identical to the first line of our transcription files. The second and third lines, the processing instructions, are identical, except that href of the first of these points to the validation file specific to the TAN-A format. Even the fourth line looks like the two TAN-T files, other than the new name for the root element, <TAN-A>, and the new value for @id.

The penultimate line, <body/>, is an empty element, and is equivalent to an opening tag immediately followed by a closing tag, i.e., <body></body>. The alternative form, <body/>, is a more succincty way to say that an element contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty.

Let's look at a third TAN format, TAN-A-tok. This particular alignment file allows you to state precise which words in one text correspond with the words in another. Because of this precision, they can take more time to create. But we even start, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what words or phrases in the older version correspond to those of the newer one. We will simplify here, and exclude punctuation (some linguists legitimately treat punctuation as words in their own right). The term word is notoriously difficult to define, so we will call them tokens, to avoid false connotations (hence the name of the file, TAN-A-tok, to refer to alignment of tokens).

We now create a TAN-A-tok file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-A-tok.rnc" 
    type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
    type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A-tok xmlns="tag:textalign.net,2015:ns" 
    id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    <head>
    . . . . . . .
    </head>
    <body reuse-type="general_adaptation" bitext-relation="B-descends-from-A">
        <!-- Examples of picking tokens by number -->
        <align>
            <tok src="ring1881" ref="1" pos="1"/>
            <tok src="ring1987" ref="1" pos="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" pos="2"/>
            <tok src="ring1987" ref="1" pos="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" pos="3"/>
            <tok src="ring1987" ref="1" pos="3"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" pos="4"/>
            <tok src="ring1987" ref="l" pos="4"/>
        </align>
        <align>
            <tok src="ring1881" ref="1" pos="5"/>
            <tok src="ring1987" ref="1" pos="5"/>
        </align>
        <!-- Examples of picking tokens by value -->
        <align>
            <tok src="ring1881" ref="2" val="A"/>
            <tok src="ring1987" ref="2" val="A"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="pocket"/>
            <tok src="ring1987" ref="2" val="pocket"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="full"/>
            <tok src="ring1987" ref="2" val="full"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="of"/>
            <tok src="ring1987" ref="2" val="of"/>
        </align>
        <align>
            <tok src="ring1881" ref="2" val="posies"/>
            <tok src="ring1987" ref="2" val="posies"/>
        </align>
        <!-- Examples of picking ranges of tokens -->
        <align>
            <tok src="ring1881" ref="3" pos="1, 2"/>
            <tok src="ring1987" ref="3" pos="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="3" pos="3 - 4"/>
            <tok src="ring1987" ref="3" pos="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" pos="1"/>
            <tok src="ring1987" ref="4" pos="1"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" pos="2"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" pos="3"/>
            <tok src="ring1987" ref="4" pos="2"/>
        </align>
        <!-- examples of using "last" -->
        <align>
            <tok src="ring1881" ref="4" pos="last-1"/>
            <tok src="ring1987" ref="4" pos="last-1"/>
        </align>
        <align>
            <tok src="ring1881" ref="4" ord="last"/>
            <tok src="ring1987" ref="4" ord="last"/>
        </align>
    </body>
</TAN-A-tok>

Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>), and the value of @id.

The heart of the data is <body>, which has two key attributes, @reuse-type, which describes the activity that was performed to change one version into the other, and @bitext-relation, which specifies how one book relates to the other. Our two values, general_adaptation and B-descends-from-A, are arbitrary names that we define in the <head> (discussed later). (To understand the concepts behind reuse types and bitext relations, see the section called “Token-based annotations and alignments (<TAN-A-tok>)”).

You will also notice some lines that begin . These are comments, and can be placed within or beside any element, and can enclose any text we like, including line breaks. You may put a comment anywhere you like, as long as it is not inside a tag or attribute.

<body> is the parent of one or more <align> elements, each of which correlates a set of tokens in each of the two texts, pointed to by its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 for our two texts, but we could have just as easily used anything else such as a and b, or uk and us. @ref has a value that points to a specific <div> in the source TAN-T transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and @pos and @val can be mixed, as in the example. It is generally a good idea to use @val, because if you fix a typo, changing the number of tokens in the underlying transcription, @val might not be affected; with @pos alone, you can't. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a token by position counting from the end of a <div>.

Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between tokens from the two texts. A token may feature in multiple <align> elements. And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a one-sided alignment. This one-sided alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation. If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A one-sided alignment containing tokens only of the later source might point to words that the translator added, i.e., what in translation studies is called explicitation.)

If in our TAN-A-tok file we say nothing about a particular word in one of the sources, that silence should not be interpreted to mean that it has no counterpart in the other source. As creators of this file, we make no claim to providing an exhaustive account, and we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them.

We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and notify other users of our data about the change. There are also ways to express doubt or alterative opinions, and to credit (or blame) the person making the assertion. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed at the section called “Token-based annotations and alignments (<TAN-A-tok>)”.

^[1]In the TAN system, [Definition: a transcription is a plain digital text that replicates a text found somewhere else, usually reproducing its script and spelling]. The following—"In pluribus unum"—is a (partial) transcription of a United States dollar. The term should be distinguished from [Definition: a transliteration, which is a transcription rendered in a script other than the original]. For example, εν πλουριμπυς ουνεμ, would be a Greek transliteration of the previous transcription.

^[2]Although the TAN examples below look much like files in the examples subdirectory of the TAN library, they have been adjusted, to explain the formats better.

^[3]Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists published on Wikipedia for both text editors and XML editors. TAN was developed using Oxygen, which is very powerful. If you are a new user, you are likely to find it overwhelming. Take advantage of tutorials and documentation associated with the XML editor you have chosen.

Prev	Up	Next
Participation	Home	TAN metadata (<head>)