Transcriptions Using the Text Encoding Initiative (<TEI>)


This section is to be read in conjunction with Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta) and the section called “The Text Encoding Initiative”, which address some technical issues that relate to TAN-compliant TEI to XML and validation generally.

Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further, or already have a library of transcriptions whose annotations are desirable to keep, even if some users may not disinterested. To serve these needs, you should use TAN-TEI, an extension to the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in scholarship.

TEI was designed to be maximally expressive and flexible, to serve the detailed needs of humanities scholars. In serving this mission, TEI has come to define more than five hundred different element names, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library.

Although the TEI format is oftentimes seen as a standard, it lacks some of the charactistics expected in a standard. It is greatly flexible, admits flavors and interpretation, and has been designed to encourage customization. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations. The major difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to ensure that transcriptions are maximally likely to be interchangeable with other TAN files.

TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace,

Table 5.1. Synopsis of TAN-TEI customization

TEI elementsummary of alteration
  • must have @id with IRI name

  • should take new namespace declaration, xmlns:tan=",2015:ns"

  • takes a new child element, <head>, placed between <teiHeader> and <text>

  • Only the child <body> will be regarded by other TAN users. <front> and <back> will be ignored.

  • must take @xml:lang

  • may take @in-progress

  • must take exclusively one or more <div>s

  • any elements or text between <div>s will be ignored

  • overall contents must be restricted to a single work

  • any and all text nodes will be treated as part of the transcription

  • must take either only <div>s or no <div>s at all

  • must take @type and @n (@include is not allowed in TAN-TEI, but is allowed in TAN-T)

Like all other TAN files, the root elements of TAN-TEI files must take an @id, the IRI name. See above, the section called “Tag URNs”.

TAN-TEI files have two heads, which may strike you as odd. The TEI head and the TAN head were designed for different purposes. Whereas the TAN <head> is meant to be brief and keyed to both IRIs and human-readable data, the <teiHeader> has been designed principally for human readability, and permits quite an expansive range of metadata, and about matters that bear on the transcription only indirectly (e.g., manuscript descriptions).

Processors of TAN-TEI files will in general ignore the contents of <teiHeader>, since the contents are unpredictable. If your <teiHeader> has any kind of metadata relevant to TAN users, you will need to adapt it for the standard TAN <head> (see the section called “Metadata (<head>)” and the section called “Principles and Assumptions”). You may find that some of the material you put in <teiHeader> is not suitable for <head> and vice versa. This conversion needs to be performed manually, since the two headers are incommensurate, and writing each one requires a different kind of outlook.

In a TAN-TEI file, the TAN <head> must declare the TAN namespace to be its default, i.e., <head xmlns=",2015:ns"> or <tan:head> if the prefix tan: has been defined in the root element.

Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. All users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. For this reason, even if you change the value of @xml:lang within a leaf <div>, there is no guarantee that readers or processors of your data will take it into account.

TAN-TEI should not be used to try to represent the physical appearance of the text on the object. Write a separate TEI (non-TAN) file first, and then use TAN-TEI to create a more normalized version.

You may need to prepare a TEI file to be TAN compliant. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps:

  1. Structure: insert new processing instructions (TAN-TEI validation files); adjust root element by supplying IRI name to @id, TAN namespace to @xmlns:tan.

  2. Metadata: create new <head> and populate it

  3. Data: edit <body> to restrict the content to a single work; restructure <body> content into nesting <div>s with correct @type and @n values.

It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming. The TAN <head> requires one to more carefully curate the metadata than does <teiHeader>. But step 3 should not be overlooked, either. Many people write TEI files with a focus on the original textual object, and they make editorial decisions that look toward the scriptum and not the intertextual ecosystem that TAN supports. It is advisable to trim from the body of your TEI file any elements that would interfere with direct comparison with other versions of the text in the TAN format.