Chapter 5. Class-1 TAN Files, Representations of Textual Objects (Scripta)

Table of Contents

Principles and Assumptions
Domain model
One version, one work, one object, one reference system
Normalizing transcriptions
Flattened References, and the Leaf Div Uniqueness Rule
Transcriptions Using the Text Encoding Initiative (<TEI>)

This chapter provides general background to class 1 TAN files. For detailed discussion of specific elements or attributes see Chapter 8, TAN patterns, elements, and attributes defined.

Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Files of this class are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be created without class 1 files.

Transcriptions come in two different formats, identified by the root element. <TAN-T> is a simple, generic format, as close as one can get to plain text. <TEI> (also referred to in this manual as TAN-TEI), on the other hand, can be complex and highly expressive. Because the two types function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

(For more general principles and assumptions applying to all TAN files, not just class 1, see the section called “Design Principles”.)

Class 1 formats are designed for faithful but judiciously normalized digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a common reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the text of a primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how the transcription might be used in other contexts, especially translation studies or a study of quotations.

Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicography. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills should be used in other TAN formats.

TAN-T(EI) editors stand at the beginning of a larger workflow for text alignment. It is critical that work not be published hastily, and only after careful proofreading. Many transcriptions, especially those of long texts, have typographical errors. Eliminating as many as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that all our files have typographical errors that can and should be corrected as they are found.

If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate alignment and annotation, which depends critically upon a stable, familiar reference system. Transcription files should be segmented and labeled according to a reference system that can be easily applied to other versions of the same text in other languages. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). See below on reference systems.

Contributors and users of TAN files should strongly distinguish between a scriptum (text-bearing object) and a conceptual work, e.g., a specific printed copy of the Iliad versus the Iliad concieved generally. The former has materiality (digital files are treated as being material) and the latter does not. Even though both are constitutively necessary for any transcription, the two are sharply differentiated in the TAN format: <source> and @src point to physical exemplars; <work> and @work to the conceptual.

The distinction may remind some readers of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies four types of entities for what they call Group 1 (Products of intellectual & artistic endeavor): Work, Expression, Manifestation, and Item, the first pair being conceptual, non-material entities and the latter pair material ones.

TAN has been designed with a slightly different domain model in mind. FRBR Items are equivalent to what TAN calls scripta. Multiple scripta that for all intents and purposes are indistinguishable (i.e., items reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no corresponding entity has been defined. It is best to think of TAN scripta as being equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable TAN scripta.

As for conceptual entities, TAN has been designed with the assumption that most users will find the distinction between Works and Expressions to be unhelpful or misleading. What one person calls a FRBR Expression another may legitimately call a Work. TAN assumes that any derivation of a Work (or Works) is itself a Work, which is really shorthand for work-version. Thus, in this manual the term version indicates merely a type of work that is known either to derive from another work or to be the basis for other versions of a work.

TAN avoids altogether the term Expression. Aside from the issues mentioned above, the term implies a medium (without which nothing can be expressed) and therefore materiality.

Every TAN-T(EI) file must be restricted to a transcription of a single version of a single conceptual work found on a single scriptum, segmented and labeled according to a single reference system.

This restrictive principle is critical to the the success of the network. It reduces the risk of confusion, simplifies the files, and shifts markup complexity from an individual transcription file to the network in which that file participates.

Each TAN-T(EI) file transcribes one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created by the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies is to be treated as a single object (an entity some librarians call a manifestation).

The definition of some scripta require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; others may be a composite of older manuscripts. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files.

It is a good idea to name your scriptum in <source> with an <IRI> value in the form of an http URL provided by a library catalogue. This way you provide a way for others, perhaps through an algorithm, to retrieve extensive, structured bibliographical information. You also save yourself the hassle of writing a detailed bibliographical description that your users would probably not be able to import into their reference management software. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same source, it would be a good idea to adopt that name.

The transcription must be restricted to a single creative work, identified by <work>.

Many scripta have more than one work. Identifying and defining the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. The work you choose from that object can take whatever contours you wish. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or simply the Beatitudes. Any definition of a work is permitted, but a TAN-T(EI) file should contain nothing but the work you have defined. It should be a complete representation of what is found on the object, even if only partially preserved, and respect as far as is practical the order of the text in the scriptum.

Well-known works may have a suitable IRI name already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN.

The transcription must be restricted to a single version of the creative work, identified by <version> (optional). In most cases, <version> is unnecessary, because <work> in conjunction with <source> are sufficient to identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> should be included.

Each versions from a scriptum should have its own separate TAN-T(EI) file.

Notes should be included only if they are an integral part of the primary work (i.e., by the same author, not by a later editor). If you think the notes to a work are important, consider putting them in their own TAN-T(EI) file, or converting them to claims in a TAN-A-div file.

If you need to specify exactly where on a scriptum a version appears, <desc> or <comment> should be used.

Very few work-versions have their own URN names. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI to name the version.

Every TAN transcription must be segmented into a hierarchy of uniquely labeled divisions, defined in the <body> through <div>s and their @type and @n values.

Those divisions, whenever possible, should align with the reference system that prevails for the work across versions or translations, what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute, the term canonical is problematic, so reference system is preferred in these guidelines.

If you have your choice, preference should be given to systems that follow the semantic contours of the work, not the physical features of a particular object. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other derivative versions of a work (e.g., translations, paraphrases) will only roughly, if at all, follow an object-oriented reference system.

Sometimes an object-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that scheme, but it may eventually entail more labor for the alignment process.

If a given work has multiple systems (e.g., the works of Plato and Aristotle, which have two reference systems—semantic- and object-oriented—both of which are standard and important), then the recommended practice is to encode the same text twice, placing in each file a <see-also> pointing to the other and a <relationship> with the keyword alternatively divided edition as the value of @which. A pair of alternatively divided editions can usefully serve as the basis for concordances. In fact, the pair can be used as the first step in converting other versions of the work from one reference system to the other.

If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. Such subdivided texts are compatible with references to the older system. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work, and in the end editors working independently upon the same text may produce discordant schemes. The TAN-A-div format was designed to reconcile such differences.

If there is no reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to optimize for all versions of the work, whether known or not.

In the <definitions>, at least one <div-type> must be supplied, declaring the types of divisions into which the text has been segmented, to be referred to by @type in each <div>. To declare a <div-type> does not require you to use it in the transcription. It is advisable to keep the abbreviation you adopt in @xml:id brief but meaningful.

Well-known division types already have suitable IRI names. See the section called “TAN keywords for types of divisions (<div-type>)” for a list of core TAN vocabulary for division types, both common and uncommon. If you encounter a rare division type, or one that needs custom specificity, you should mint your own, either in the declarations or in a separate TAN-key file.

Reference systems have as a central component numbering systems. TAN supports five major numeration systems:

  1. Arabic numerals. 1, 2, 3, etc.

  2. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value is assumed to be a subtraction from the total value; all others are positive values).

  3. Alphabetic sequences. The 26-letter Roman alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed.

  4. Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c....

  5. Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals.

TAN file processors will attempt to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences, e.g., c (= 3 or 100). Such numerals are assumed to be roman, unless you supply a <ambiguous-letter-numerals-are-roman> and define it as false.

There are also tools for other numeration systems, but they have not been implemented in the validation process. See tan:letter-to-number() and dependencies.

You should declare how you have normalized the transcription via <alter> and its children, e.g., <normalization>. (For suggestions on values of <IRI> for <normalization> see the section called “TAN keywords for types of normalizations (<normalization>)”.)

Generally speaking, normalization entails the suppression of things extraneous to or separable from the work you have chosen. You are encouraged to omit parenthetical editorial insertions (especially quotation references), stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). If chapter 4 begins "4." or "IV" then leave out the prefatory numeral—you've already indicated it in @n. In addition, you should resolve ligatures and correct unintended typographical errors. (Such orthographic corrections are useful to those users who want to generate lexico-morphological data automatically or semiautomatically.)

The goal is a transcription whose text is free of the interpretive voice of later editors. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for later users. If you are segmenting a source into line breaks, and you are required to break a word between divisions, you should either use the soft hyphen (&#xad;) or the zero-width joiner (&#x200d;) at the end of the first leaf <div>. TAN processors that handle a leaf <div> will automatically normalize the space in the element, then place a space between that leaf <div> and the next unless if one of those two characters are found at the end of the first, in which case the character will be deleted and the two <div>s will be joined with no intervening space. For more on issues regarding whitespace, see the section called “White space”.

In a digital source, variable lengths of spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces, and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see the section called “Normalization”).

If you are working with a text with notes, distinguish between those written by the same person who wrote the work you're transcribing from those that aren't. Treat the former as part of the work proper and give each note a <div> with a suitable @type and place it after the <div> it annotates. It will be assumed by processors of the data that, absent more specific information, any <div> of an annotating @type is an annotation of the last <div> that is not an annotation. (Alternatively, you may use the <note> feature of TAN-TEI, but bear in mind that this element will be treated by users as part of the leaf div to which it belongs, not separate from it.)

If the notes are not part of the work per se—for example, translator's notes in a translation of a primary source—you should treat them as a separate work altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two through <see-also>. You may wish to structure that file so that it mirrors the reference system of the primary source, to facilitate automatic alignment between the two.

Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts, and should be deleted. If the connective function served by the note signal is important, create <claim>s in a TAN-A-div file, which supports correlating comments to specific ranges of text.

This principle holds true for variants in the scriptum. For example, a manuscript may have correctors' marks. Or a set of footnotes (or apparatus criticus) might comment on how and why the main text differs from previous readings. In those cases, each set of corrections might be wholly incorporated into the <claim>s of a TAN-A-div file, perhaps also with a separate TAN-T file.

Overall, normalization is a difficult topic, and it is not well studied. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that lend themselves to varying conventions may need special consideration. You may need to consider whether an unusual or rarely used Unicode character might be misinterpreted or hinder other users. Document any decisions in the <alter>.

In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme. On the one hand, they are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's <milestone rend="">. In that way, the numerals are removed from the main text; on the other hand, the information is retained. Generally speaking TEI's @rend is an excellent way to remove something from the main text, without removing it from the file altogether.