Division-Based Alignments (<TAN-A-div>)

TAN-A-div is the format for macroscopic, division-based alignment, and is dedicated to aligning any number of versions of any number of works on the basis of <div>s, or even smaller, ad hoc segments in the sources invoked.

A TAN-A-div file provides two major services.

Reconciling structural differences between versions of the same text. Some independently created transcriptions of the same work will, no matter the good intentions of the transcribers, fail to correspond exactly to related versions. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows a reference system at odds with the majority of other versions. Perhaps a version is interpolated or lacunose. TAN-A-div is used to reconcile such inconsistencies, to make special alignments that a computer might not be able to make accurately, and to refine the alignment of parallel sources, even down to the word level.

Make general claims about a work, or a particular version of a work. Scholars working with texts regularly wish to make claims about those texts, e.g., work A passage b quotes from work X passage c; work A passage b deals with topic M; work A passage b word 7 has a variant reading b' in version A1.

For the first purpose, the motivations of an aligner are opaque. A TAN-A-div file says, in essence, "Please align the following sources," but it does not say why the alignment is requested, and it does not indicate what relationship holds between the various sources. In fact, a TAN-A-div file could be used to align texts that have no apparent relationship (to what end would be unclear).

For the second purpose, the aligner makes claims about the texts, and motivations and assumptions are made as clear as possible.

Processors of a TAN-A-div file will assume greedy alignment. Alignments will be inferred wherever possible, when not explicitly overridden. Alignments are also transitive. If passage A is declared to align with B, then, barring any exceptions, anything that aligns with A will be assumed to align with anything that aligns with B (see the section called “Interpretation of multiple values”).

The root element of a TAN division-based alignment file is <TAN-A-div>.

TAN-A-div's <head> has some special rules.

One or more <source>s must be declared (the section called “Distinguishing <source>s and <see-also>s”). That an alignment file would have only a single source may seem strange, but such a scenario could be useful for self-alignment (i.e., to indicate places where a source reuses itself), or to make claims about that text.

<declarations> takes zero or more of the declarations common to class 2 files: <token-definition>, <suppress-div-types>, <rename-div-ns>. See the section called “Common Elements”. TAN-A-div also allows declarations unique to the section called “~TAN-c-decl-core.

A TAN-A-div may have an empty <body> because the format by default demands greedy alignment. That is, it effectively states, "Take the list of sources in the header. First group (align) them by work, then by <div>s according to flat refs."

A processor will create groups of works according to the <IRI> values under <work> in each source. To those matches will be added any sources you claim are equivalently the same work. Then within each group of versions of the same work, the processor will align (group) <div>s based on their flat ref (based on @n), after normalization and after taking into account exceptions declared in the TAN-A-div file.

If sources representing different versions of the same work already have <div>s whose flat refs match well, then nothing needs to be declared in a TAN-A-div <body>. A TAN-conformant processor will perform the alignment.

Within the <body> of a TAN-A-div file, the first optional procedure, reconciliation, is an up-to-four-step process. Each step is optional and sequence-specific. That is, each statement assumes actions specified by previous siblings have already been implemented.

After reconciliation happens, the second optional procedure, claims, are handled.

In the first step you may declare an ad hoc equivalence between sources that do not already share an <IRI> value for <work>. Each equivalence is made through an <equate-works>, which groups together under @src the ids of sources that should be treated as containing the same work.

Transitive alignment holds: <equate-works work="a b"/> means that any sources that share the same works as a and b will also be treated as equivalent.

This declaration does not imply that the works are, in reality, one and the same. It merely states that, for the purposes of this alignment, they should be treated as equivalent.

The second step does for div types what the first step did for works, with <equate-div-types>. Across all sources, every <div-type> that shares an <IRI> value will be treated as equivalent. But you may augment that automated alignment through an <equate-div-types>, which takes one or more <div-type-ref>s, each of which takes a mandatory @src and @div-type-ref, to point to one or more sources and division types. You must use the @xml:id assigned by the source to that div type.

As with <equate-works>, <equate-div-types> assume a greedy, transitive alignment. The ad hoc declaration does not imply that the two types of division are in reality one and the same; it just correlates them for the sake of the alignment.

This step is not likely to be used in most TAN-A-div files, because it has no impact on the steps that follow, or even on alignment proper, since it does not affect the reconciliation of flat refs. It is useful mainly in those cases where you expect users of your file to be interested in comparing division types (e.g., calculating ratios of paragraphs to chapters per version per work).

Suppose you have two transcriptions where a phrase ending one leaf <div> in source A actually corresponds to the beginning phrase of the next leaf <div> in source B. Or suppose that you wish to break down a leaf <div> into smaller constituent parts, to facilitate more exact alignment against another version that is divided more granularly. Before these refined alignments can occur, you must first segment specific leaf <div>s through <split-leaf-div-at>, which contains one or more <tok>s pointing to individual words (see the section called “@pos and @val”) that should begin a new segment in each reference in each source.

@ref must refer only to leaf <div>s. Any leaf <div> may be split as many times as one wishes, but never at the first token.

After step 3, some of the divisions and segments of a work may not be properly aligned. Segments newly created by <split-leaf-div-at>s may need to be realigned. Or perhaps one of the sources uses a reference system that is out of step with the others. <realign> is used to reconcile differences. It is not used for aligning across works.

There are two types of realignment: anchored and unanchored, discussed in detail at <realign>.

At this point, each work should have its versions properly aligned. You are now in a position to indicate other places where one work quotes from another, or make other comments on specific textual passages. In this process, <claim> may be used to indicate such things as:

  • textual passages where one work quotes or alludes to another work or itself (index of quotations and allusions);

  • textual passages deal with a certain topic (general index);

  • where notes in one source correspond to main text in another (tethering separated notes from main text);

  • alternative readings of a textual passage (apparatus criticus).

These alignments occur through <claim>s whose <subject> or <object> points to passages of text.

Any textual <subject> or <object> may take @work or @src. The former takes a single reference to a <source>, but adopts the reference as a proxy to make a claim applicable to all versions of the same work. @src restricts the claim to specific versions, not to the work as a whole.

<claim> is most commonly used to create an interoperable index, indicating where one work quotes from another. Such claims should not be taken to apply to the whole (see the section called “Interpretation of multiple values”). A claim that passage b quotes passage y means only that some part of b quotes from some part of y, not that the whole of b quotes from the whole of y. Specificity must made on the level of <tok>, a child of a textual <subject> or <object>.

Furthermore, if that <tok> is governed by @work and not @src, then two statements are implied, first that the claim pertains to such-and-such a particular range of tokens in a particular source, and second that the claim pertains to other versions of the same work, but at unspecified ranges of words. For example:

<claim verb="quotes">
   <subject work="nt-grc">
      <tok ref="Mk 10:6" pos="last-4 - last"/>
   <object work="lxx">
      <tok ref="Gen 1:27" pos="last-4 - last"/>

might correlate the following leaf divs (matches in bold):

<div n="27" type="v">καὶ ἐποίησεν ὁ θεὸς τὸν ἄνθρωπον κατ' εἰκόνα 
θεοῦ ἐποίησεν αὐτόν ἄρσεν καὶ θῆλυ ἐποίησεν αὐτούς</div>
. . . . . 
<div type="v" n="6">ἀπὸ δὲ ἀρχῆς κτίσεως ἄρσεν καὶ θῆλυ ἐποίησεν 

Even though the claim is about the work in general, the statement provides specificity to only two sources. The claim will be regarded as holding over other versions of the same works, but only on the leaf div level. On the token level, it is up to a processor to determine if and where the relative position of the quote might be found.