Division-Based Alignments (<TAN-A-div>)

TAN-A-div is the format for macroscopic, division-based alignment, and is dedicated to aligning any number of versions of any number of works on the basis of <div>s, or even smaller, ad hoc segments in the sources invoked.

A TAN-A-div file provides two major services.

Reconciling structural differences between versions of the same text. Some independently created transcriptions of the same work will, no matter the good intentions of the transcribers, fail to correspond exactly to other versions of the same works. Perhaps the works or div types will not be defined identically, or perhaps one version follows a reference system at odds with the majority of other versions. Perhaps a version is interpolated or lacunose. TAN-A-div is used to reconcile such inconsistencies, to make special alignments that a computer might not be able to detect automatically, and to refine the alignment of parallel sources, even down to the word level.

Make general claims about a work, or a particular version of a work. Scholars working with texts regularly wish to assert claims about those texts, e.g., work A passage b quotes from work X version Y passage c; work A passage b deals with topic M; work A passage b word 7 has a variant reading in version A1, where it reads "exemplum."

For the first purpose, the motivations of an aligner are opaque. A TAN-A-div file says, in essence, "Please align the following sources," but it does not say why the alignment is requested, and it does not indicate what relationship holds between the various sources. In fact, a TAN-A-div file could be used to align texts that have no apparent relationship (to what end would be unclear).

For the second purpose, the aligner makes claims about the texts, and motivations and assumptions are made as clear as possible.

Processors of a TAN-A-div file will assume greedy alignment. Alignments will be inferred wherever possible, when not explicitly overridden. Alignments are also transitive. If passage A is declared to align with B, then, barring any exceptions, anything that aligns with A will be assumed to align with anything that aligns with B.

The root element of a TAN division-based alignment file is <TAN-A-div>.

Under <head>, some special rules apply to TAN-A-div types.

One or more <source>s must be declared (the section called “Distinguishing <source>s and <see-also>s”). That an alignment file would have only a single source may seem strange, but such a scenario could be useful for self-alignment (i.e., to indicate places where a source reuses itself).

<declarations> takes zero or more of the declarations common to class 2 files: <token-definition>, <suppress-div-types>, <rename-div-ns>. See the section called “Common Elements”. TAN-A-div also allows declarations unique to the section called “~TAN-c-decl-core.

A TAN-A-div may have an empty <body> because the format by default demands greedy alignment. That is, it effectively states, "Take the list of sources in the header. First group (align) them by work, then by <div>s according to flat refs."

A processor will create groups of works according to the <IRI> values under <work> in each source. To those matches will be added any sources you claim are equivalently the same work. Then within each group of versions of the same work, the processor will align (group) <div>s based on their flat ref (based on @n), after normalization and after taking into account exceptions declared in the TAN-A-div file.

If sources representing different versions of the same work already have <div>s whose flat refs match well, then nothing needs to be declared in a TAN-A-div <body>. A TAN-conformant processor will perform the alignment.

The <body> of a TAN-A-div file, therefore, is used only (1) to reconcile and customize the alignment between source TAN-T(EI) files and (2) to make claims about the texts.

The first procedure is an up-to-four-step process. Each step is optional and sequence-specific. That is, each statement assumes declarations made in previous siblings have already been taken into account. These steps are fundamentally simple, even if the descriptions that follow seem unnecessarily detailed or complex. You are advised to consult examples in the detailed synopsis of the elements mentioned below.

After the steps of the first procedure are handled, the claims in the second procedure are dealt with.

In the first step you may declare an ad hoc equivalence between sources that do not already share an <IRI> value for <work>. Each equivalence is made through an <equate-works>, which groups together under @src the ids of sources that should be treated as containing the same work.

Transitive alignment holds: <equate-works work="a b"/> means that any sources that share the same works as a and b will also be treated as equivalent.

This declaration does not imply that the works are, in reality, one and the same. It merely states that, for the purposes of this alignment, they should be treated as equivalent.

The second step does for div types what the first step did for works, with <equate-div-types>. Across all sources, every <div-type> that shares an <IRI> value will be treated as equivalent. But you may augment that automated alignment through an <equate-div-types>, which takes one or more <div-type-ref>s, each of which takes a mandatory @src and @div-type-ref, to point to one or more sources and division types. You must use the @xml:id assigned by the source to that div type.

As with <equate-works>, <equate-div-types> assume a greedy, transitive alignment. The ad hoc declaration does not imply that the two types of division are, in reality, one and the same—it just correlates them for the sake of the alignment.

This step is not likely to be used in most TAN-A-div files, because it has no impact on the steps that follow, or even on alignment proper, since it does not affect the reconciliation of flat refs. It is useful mainly in those cases where you expect users of your file to be interested in comparing division types (e.g., calculating ratios of paragraphs to chapters per version per work).

Suppose you have two transcriptions where a phrase ending one leaf <div> in source A actually corresponds to the beginning phrase of the next leaf <div> in source B. Or suppose that you wish to break down a leaf <div> into smaller constituent parts, to facilitate more exact alignment against another version that is divided more granularly. Before these refined alignments can occur, you must first segment specific leaf <div>s through <split-leaf-div-at>, which contains one or more <tok>s pointing to individual words (see the section called “@pos and @val”) that should begin a new segment in each reference in each source.

@ref must refer only to leaf <div>s. Any leaf <div> may be split as many times as one wishes, but never at the first token.

After step 3, some of the divisions and segments of a work may not be properly aligned. Segments newly created by <split-leaf-div-at>s may need to be realigned. Or perhaps one of the sources uses a reference system that is out of step with the others. <realign> is used to reconcile differences. It is not used for aligning across works (see step 5, below).

There are two types of realignment: anchored and unanchored, discussed in detail at <realign>.

The structure of <realign> follows that of <align>, described in the next step.

At this point, each work should have its versions properly aligned. You are now in a position to indicate other places where one work quotes from another, or make other comments on specific textual passages. In this process, <claim> may be used to indicate where one work quotes another or itself. It can also be used to indicate passages deal with a certain topic, particularly helpful as a basis for creating a general index. It can even be used to specify where the reading of one source departs from the majority reading (the basis for the apparatus criticus in a critical edition).

These alignments occur through <claim>s that have <subject>, <object>, or both pointing to passages of text.

Any textual <subject> or <object> may take @work or @src. The former takes a single reference to a <source>, but uses it as a proxy to make a claim applicable to all sources of the same work. The latter takes one or more references, and makes the claim for those versions exclusively.

This process is commonly used to indicate where one work quotes from another. If work A passage b is said to quote work X passage y, there is no implication that the entirety of b is a quotation of the entirety of y. Such claims need to be made on the level of <tok>, a child of a textual <subject> or <object>. Furthermore, if that <tok> is governed by @work and not @src, then two statements are implied, first that the claim pertains to such-and-such a particular range of tokens in a particular source, and second that the claim pertains to other versions of the same work, but at unspecified ranges of words. It is up to a processor to use an algorithm to determine where the relative position of a quote in version A1 is to be found in versions A2, A3, and so forth.