Chapter 5. Class-1 TAN Files, Representations of Textual Objects (Scripta)

Table of Contents

Principles and Assumptions
General
Domain model
One version, one work, one object, one reference system
Normalizing transcriptions
Transcriptions
Flattened References, and the Leaf Div Uniqueness Rule
Transcriptions Using the Text Encoding Initiative (<TEI>)

This chapter provides general background to the elements and attributes that are common to all class 1 TAN files. For detailed discussion see Chapter 8, TAN patterns, elements, and attributes defined.

Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Files of this class are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be created without class 1 files.

Transcriptions come in two different formats, identified by the root element. <TAN-T> is a simple, generic format, as close as one can get to plain text. <TEI> (also referred to in this manual as TAN-TEI), on the other hand, can be complex and highly expressive. Because the two types function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

(For more general principles and assumptions applying to all TAN files, not just class 1, see the section called “Design Principles”.)

Class 1 formats are designed for faithful but judiciously edited digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a common reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how users of the transcription might use it in other contexts, especially translation studies or a study of quotations.

Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicology. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills are better used in other TAN formats.

TAN-T(EI) editors stand at the beginning of a larger workflow for text alignment. It is critical that work not be published hastily, and only after careful proofreading, especially of white space. Many transcriptions, especially those of long texts, have typographical errors. Eliminating as many as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that all our files have typographical errors that we need to correct as they are found.

If you are creating a TAN-T(EI) file, you are doing so primarily to service text alignment. To align is to correlate texts that are similar because of copying, translating, paraphrasing, revising, quoting, summarizing, and so forth. In all these processes, one or more texts, usually called the source (or sources), serves as the basis for a new text, oftentimes called the target. In many cases, the target and source bear little resemblance to each other. Therefore the best transcription files are those whose structures look to an archetype, not a particular version. Editors of TAN transcriptions should not worry about preserving the appearance of its source (i.e., it should not be a diplomatic edition), and they should structure the text, when possible, by the most familiar reference system for that work. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). See below on reference systems.

Contributors and users of TAN files must assume a firm distinction between a scriptum (text-bearing object) and a conceptual work, e.g., a specific printed copy of the Iliad versus the Iliad concieved generally. The former has materiality (digital files are treated as having materiality) and the latter does not. Even though both are constitutively necessary for any transcription, the two are sharply differentiated in the TAN format: <source> and @src point to physical exemplars; <work> and @work to the conceptual.

The distinction may remind some readers of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies four types of entities for what they call Group 1 (Products of intellectual & artistic endeavor): Work, Expression, Manifestation, and Item, the first pair being conceptual, non-material entities and the latter pair material ones.

TAN has been designed with a slightly different domain model in mind. FRBR Items are equivalent to what TAN calls scripta. Multiple scripta that for all intents and purposes are indistinguishable (i.e., items reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no corresponding entity has been defined. It is best to think of TAN scripta as being equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable TAN scripta.

As for conceptual entities, TAN has been designed with the assumption that most users will find the distinction between Works and Expressions to be unhelpful or false. What one person calls a FRBR Expression another may legitimately call a Work (e.g., the King James Version is more than just a translation of the Bible). TAN assumes that any derivation of a Work (or Works) is itself a Work, which is really shorthand for work-version. Thus, in this manual the term version indicates merely a type of work that is known either to derive from another work or to be the basis for other versions of a work.

TAN avoids altogether the term Expression. Aside from the issues mentioned above, the term implies a medium (without which nothing can be expressed) and therefore materiality.

Every TAN-T(EI) file must be restricted to a transcription of a single version of a single conceptual work found on a single scriptum, segmented and labeled according to a single reference system.

This restrictive principle is critical to the the success of the network. It reduces the risk of confusion, simplifies the files, and shifts markup complexity from an individual transcription file to the network in which that file participates.

Each TAN-T(EI) file transcribes one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created in the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies is to be treated as a single object (an entity some librarians call a manifestation).

The definition of some scripta require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; other manuscripts have been physically altered. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files.

It is a good idea to name your scriptum in <source> with an <IRI> value in the form of an http URL provided by a library catalogue. This way you provide a way for others, perhaps through an algorithm, to retrieve extensive, structured bibliographical information. You also save yourself the hassle of writing a detailed bibliographical description that your users would have to tailor to suit their distinctive purposes. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same source, it would be a good idea to adopt that name.

The transcription must be restricted to a single creative work, identified by <work>.

Many scripta have more than one work. Identifying and defining the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. The work you choose from that object can take whatever contours you wish. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or simply the Beatitudes. Any reasonable definition of a work is permitted, but a TAN-T(EI) file must contain nothing but the work you have defined. It should be a complete representation of what is found on the object (even if only partially preserved), and respect as far as is practical the order found in the scriptum.

Well-known works may have a suitable IRI name already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN. Any IRIs that you mint are free to be used by other people writing TAN files about the same work. Similarly, if you find that another TAN-T file has transcribed a version of your work, you may also use that URN (you don't need to ask permission, since no URN can be copyrighted). As with other parts of the metadata, multiple <IRI>s and <name>s are names for the same work, not individual names for different works.

The transcription must be restricted to a single version of the creative work, identified by <version> (optional). In most cases, <version> is unnecessary, because <work> in conjunction with <source> are sufficient to identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> must be included.

If you wish to include other versions from a source, each one should have its own separate TAN-T(EI) file.

Notes should be included only if they are an integral part of the primary work (i.e., by the same author). Otherwise, you should ask yourself whether the notes are of any real interest. If they are not, ignore them. If they are important, put them in their own TAN-T(EI) file, or convert them to claims in a TAN-A-div file.

If you need to specify exactly where on a scriptum a version appears, <desc> or <comment> should be used.

Very few work-versions have their own URN names. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI to name the version.

Every TAN transcription must be segmented into a hierarchy of uniquely labeled divisions, defined in the <body> through <div>s and their @type and @n values.

Those divisions, whenever possible, should align with the reference system that prevails for the work across versions or translations, what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute the term canonical is problematic, so reference system is preferred in these guidelines.

If you have your choice, preference should be given to systems that follow the semantic contours of the work, not the physical features of a particular object. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other derivative versions of a work (e.g., translations, paraphrases) will only roughly, if at all, follow an object-oriented reference system.

Sometimes an object-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that scheme, but it may eventually entail more labor for the alignment process.

If a given work has multiple systems (e.g., the works of Plato and Aristotle, which have two reference systems—semantic- and object-oriented—both of which are standard and important), then the recommended practice is to encode the same text twice, placing in each file a <see-also> pointing to the other and a <relationship> with the keyword alternatively divided edition as the value of @which. A pair of alternatively divided editions can usefully serve as the basis for concordances. In fact, the pair can be used as the first step in converting another version of the same work from one reference system to the other.

If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. Such subdivided texts are compatible with references to the older system. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work, and in the end editors working independently upon the same text may produce discordant schemes. The TAN-A-div format was designed to reconcile such differences.

If there is no reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to optimize for all versions of the work, whether known or not.

In the <declarations>, at least one <div-type> must be supplied, declaring the types of divisions into which the text has been segmented, to be referred to by @type in <div>s. To declare a <div-type> does not require you to use it in the transcription. It is advisable to keep the abbreviation coined in @xml:id brief but meaningful.

Well-known division types already have suitable IRI names. See the section called “TAN keywords for types of divisions (<div-type>)” for a list of core TAN vocabulary for division types, both common and uncommon. If you encounter a rare division type, or one that needs specificity not provided for in a well-known URN, you should mint your own, either in the declarations or in a separate TAN-key file.

Reference systems have as a central component numbering systems. TAN supports five numeration systems:

  1. Arabic numerals. 1, 2, 3, etc.

  2. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value is assumed to be a subtraction from the total value; all others are positive values).

  3. Alphabetic sequences. The 26-letter Roman alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed.

  4. Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c....

  5. Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals.

TAN file processors will attempt to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences, e.g., c (= 3 or 100), so this conversion takes place within the context of a single document, without reference to any associated files. You may not mix Roman numerals and alphabetic sequences in the same div type. You should also avoid any string labels that would be misinterpreted as a Roman numeral. For example, if you are labeling a book whose title is "Civilizations," you should not use n="Civ", since all values of @n are treated as lowercase.

There are also tools for other numeration systems, but they have not been implemented in the validation process. See tan:arabic-numerals(), tan:grc-to-int(), and tan:syc-to-int().

You should declare how you have normalized the transcription via <filter> and its children, <normalization>, <transliteration>, and <replace>. (For suggestions on values of <IRI> for <normalization> see the section called “TAN keywords for types of normalizations (<normalization>)”.)

Generally speaking, normalization entails the suppression of things extraneous to or separable from the work you have chosen. You are encouraged to omit parenthetical editorial insertions, stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). The goal is a transcription whose text is free of the interpretive voice of later editors. In addition, you should resolve ligatures and correct unintended typographical errors. (Such orthographic corrections are useful to those users who want to generate lexico-morphological data automatically or semiautomatically.)

In a digital source, variable lengths of spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces, and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see the section called “Normalization”).

Keep in mind that your transcriptions will be used by other people doing, e.g., word-for-word translation alignments, quotation checking, syntactical analysis, and they will want transcriptions that are as clean as possible. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for later users. If you are segmenting a source into line breaks, and you are required to break a word between divisions, you should either use the soft hyphen (&#xad;) or the zero-width joiner (&#x200d;) at the end of the first <div>. TAN processors that handle a <div> will automatically normalize the space in the element, then place a space between that <div> and the next unless if one of those two characters are present, in which case the character will be deleted and the two <div>s will be joined with no intervening space. For more on issues regarding whitespace, see the section called “White space”.

If you are working with a text with notes, distinguish between those written by the same person who wrote the work you're transcribing from those that aren't. Treat the former as part of the work proper and give each note a <div> with a suitable @type and place it after the <div> it annotates. It will be assumed by processors of the data that, absent more specific information, any <div> of an annotating @type is an annotation of the last <div> that is not an annotation. (Alternatively, you may use the <note> feature of TAN-TEI, but bear in mind that this element will be treated by users as part of the leaf div to which it belongs, not separate from it.)

If the notes are not part of the work per se—for example, translator's notes in a translation of a primary source—you should treat them as a separate work altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two through <see-also>. You may wish to structure that file so that it mirrors the reference system of the primary source, in which case further alignment between the two is not needed. Or you may wish to use a reference system that reflects how you would cite the note, e.g., page and note number. In this latter case, you would then create a companion TAN-A-div file that establishes links between the primary source and its annotations.

Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts, and should be deleted. If the connective function served by the note signal is important, use a TAN-A-div file to link the notes to the main text.

This principle holds true for transcribing texts that have variants to the work integrated into the document. For example, a manuscript may have correctors' marks. Or a set of footnotes (or apparatus criticus) might comment on how and why the main text differs from previous readings. In those cases, each set of corrections might be wholly incorporated into the <claim>s of a TAN-A-div file, perhaps also with a separate TAN-T file.

Overall, normalization is a difficult topic, and it is not well studied. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that lend themselves to varying conventions may need special consideration. You may need to consider whether an unusual or rarely used Unicode character might be misinterpreted, or a hindrance to other users (especially for parsing word tokens). Describe any decisions that might not be agreeable to everyone who uses the file in the <filter>.

In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme. On the one hand, they are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's <milestone rend="">. In that way, the numerals are removed from the main text; on the other hand, the information is retained. Generally speaking TEI's @rend is an excellent way to remove something from the main text, without removing it from the file altogether.