Transcriptions Using the Text Encoding Initiative (<TEI>)

Transcriptions Using the Text Encoding Initiative (`<TEI>`)
Prev	Chapter 5. Class-1 TAN Files, Representations of Textual Objects (Scripta)	Next

Transcriptions Using the Text Encoding Initiative (`<TEI>`)

	Note
	This section is to be read in conjunction with Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta) and the section called “The Text Encoding Initiative”, which address related technical issues.

Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to every user. In these cases, you should use TAN-TEI, a customization of the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in textual scholarship.

TEI was designed to be maximally expressive and flexible, to serve the detailed needs of scholars in the humanities. In serving this mission, TEI has come to define more than five hundred different elements, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library.

Although TEI XML is oftentimes described as a standard, it lacks charactistics one normally expects of a standard. It is very flexible, admits flavors and interpretation, and is best used when it is customized. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major difference between TEI All and TAN-TEI is that the latter imposes extra strictures, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files.

All TEI files are validated against a TEI-conformant schema normally as an XML DTD, RELAX NG, or W3C Schema. TAN's TEI-conformant schema is based upon the TAN-TEI.odd file in the schemas directory, converted to a RELAX-NG file, TEI.rnc and TEI.rng, to define the structural rules of TAN-TEI files. There is an additional layer of validation, through the related Schematron process (TEI.sch), which performs detailed validation not possible in a TEI-conformant schema. In the discussion below, it is important to distinguish between structural validation and Schematron validation. See the section called “The TAN Validation Process”.

TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0):

Table 5.3. Synopsis of TAN-TEI customization

TEI element	Strictures
`<TEI>`	must have `@id` with tag URN must have `@TAN-version` takes a new child element, `<head>`, placed between `<teiHeader>` and `<text>`; it and its descendants must be in the TAN namespace, `xmlns:tan="tag:textalign.net,2015:ns"`
`<text>`	There are no extra strictures, but during Schematron validation (not RELAX-NG), this element and any children `<front>` and `<back>` will be ignored. Of its children, only `<body>` will be Schematron validated.
`<body>`	must take `@xml:lang` any non-`<div>` children will be ignored during Schematron validation; most often only `<div>` should be children contents must be restricted to a single version of a single work any and all text nodes will be treated as part of the transcription
`<div>`	may encompass a textual division of whatever size you like (TEI defines `<div>` as being larger than block-like or paragraph-like textual divisions; TAN's `<div>` is much more like HTML's). must take elements; either they all are `<div>`s (perhaps interleaved with anchors such as `<pb>`) or none of them are `<div>`s (non-mixed model) must take `@type` and `@n` (or only `@include`) `@type` may take multiple values, space delimited, pointing via IDref to a vocabulary item `@n` must consist of word characters or the underscore, conforming to the following regular expression: `[\w\._]+([\- ,]+[\w\._]+)*`. If `@n` is to be given more than one value, those items must be separated by a space or a comma. A hyphen-minus, - (U+002D, the most common form of hyphen), always has special meaning in `@n`, specifying a range. This feature is useful for cases where a `<div>` straddles more than one standard reference number (e.g., a translation of Aristotle that cannot be easily tied to Bekker numbers). If you need to use a hyphen-like character in an `@n` that does not specify a range of numbers, consider ‐ (U+2010 HYPHEN), ‑ (U+2011 NON-BREAKING HYPHEN), ‒ (U+2012 FIGURE DASH), – (U+2013 EN DASH), or − (U+2212 MINUS SIGN).

TAN-TEI files have two heads, which may strike you as strange. Each head does something different, and was designed for different purposes. Whereas the TAN <head> is meant to be brief and restricted to only those matters relevant to the transcription, the <teiHeader> permits quite an expansive range of metadata, and may be used to encode a variety of things, including those that are tangential or irrelevant to the data. Unlike the TAN <head>, whose data is designed to be both computer- and human-readable, <teiHeader> was designed for data to be read principally by humans; although it can accommodate IRIs, it was not designed around them. Further, a TAN <head> can never be empty and valid; a bare-bones <teiHeader> with no actual text content, such as the following, is considered valid:

<teiHeader>
   <fileDesc>
      <titleStmt><title/></titleStmt>
      <publicationStmt><p/></publicationStmt>
      <sourceDesc><p/></sourceDesc>
   </fileDesc>
</teiHeader>

TAN's Schematron validation process ignores the contents of <teiHeader>, since its contents are unpredictable and therefore not reliably parsable. If your <teiHeader> has any kind of metadata that needs to appear in the TAN <head> (see the section called “Metadata (<head>)” and the section called “Principles and Assumptions”), the conversion needs to be performed manually, since (as mentioned above) the two headers are incommensurate, and writing each one requires a different mentality.

In a TAN-TEI file, the TAN <head> must be in the TAN namespace, i.e., <head xmlns="tag:textalign.net,2015:ns"> (or <tan:head xmlns:tan="tag:textalign.net,2015:ns">, but this would require all descendant elements to be prefixed tan:).

Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. Most users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s.

TEI files are flexible, permitting different approaches to markup. A TAN-TEI file should not be scriptum-oriented, i.e., it should not try to replicate how the text appears or looks on the object.

You may have a TEI file that you wish to convert to TAN-TEI. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps:

Structure: insert new processing instructions (pointing to files to perform TAN-TEI structural and Schematron validation); adjust root element by supplying a tag URN for @id and @TAN-version.
Metadata: create new <head xmlns="tag:textalign.net,2015:ns"> and populate it.
Data: edit <body> to make sure all text nodes are restricted to the content of a single version of a single work; restructure <body> content into nesting <div>s with correct @type and @n values.

It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. Some TEI files have been written with little attention paid to space and space normalization. Some TEI files are so laden with annotations that the text is impossible to read. In general, the more simple the TEI file the better, with annotations pushed to external files.

Some TEI markup is already implicit, or is easily calculable (e.g., <w> to mark words, which should already comport with the tokenization declared in the <head>; users of <w> easily lose track of where space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g., lexico-morphological data, which should be expressed in a TAN-A-lm file).

Prev	Up	Next
Class 1 Data	Home	Chapter 6. Class-2 TAN Files, Annotations of Texts