Transcriptions Using the Text Encoding Initiative (<TEI>)

[Note]Note

This section is to be read in conjunction with Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta) and the section called “The Text Encoding Initiative”, which address some technical issues that relate to TAN-compliant TEI to XML and validation generally.

Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further, perhaps identifying quotations or other features. Others may already have a library of transcriptions with detailed annotations that are desirable to keep, even if TAN users may not be interested. To serve these needs, you should use TAN-TEI, an extension to the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in scholarship.

TEI was designed to be maximally expressive and flexible, to serve the detailed needs of humanities scholars. In serving this mission, TEI has come to define more than five hundred different element names, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library.

Although the TEI format is oftentimes seen as a standard, it does have some of the charactistics most people think of about standards. It is greatly flexible, admits flavors and interpretation, and has been designed to encourage customization. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations. The major difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to ensure that transcriptions are maximally likely to be interchangeable with other TAN files.

TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0):

Table 5.1. Synopsis of TAN-TEI customization

TEI elementsummary of alteration
<TEI>
  • must have @id with IRI name

  • should take new namespace declaration, xmlns:tan="tag:textalign.net,2015:ns"

  • takes a new child element, <head>, placed between <teiHeader> and <text>

<text>
  • Only the child <body> will be regarded by other TAN users. <front> and <back> will be ignored.

<body>
  • must take @xml:lang

  • may take @in-progress

  • must take exclusively one or more <div>s

  • any elements or text between <div>s will be ignored

  • overall contents must be restricted to a single work

  • any and all text nodes will be treated as part of the transcription

<div>
  • must take either only <div>s or no <div>s at all

  • must take @type and @n (@include is not allowed in TAN-TEI, but is allowed in TAN-T)


Like all other TAN files, the root elements of TAN-TEI files must take an @id, the IRI name. See above, the section called “Tag URNs”.

TAN-TEI files have two heads. It would be convenient if there were a clear, automatic mapping between the TEI head and the TAN head, but each one was designed to serve a distinct purpose. Whereas the TAN <head> is designed to be brief and optimized for both humans and computers, the <teiHeader> has been designed principally for human readability, and permits quite an expansive range of metadata. Processors of TAN-TEI files will in general ignore the contents of <teiHeader>, since the contents are unpredictable. If your <teiHeader> has any kind of metadata relevant to TAN users, you will need to adapt it for the standard TAN <head> (see the section called “Metadata (<head>)” and the section called “Principles and Assumptions”). You may find that some of the material you put in <teiHeader> is not suitable for <head> and vice versa. This conversion needs to take place by hand, since each header requires a different kind of mentality.

In a TAN-TEI file, the TAN <head> must declare the TAN namespace to be its default, i.e., <head xmlns="tag:textalign.net,2015:ns"> or <tan:head> if the prefix tan: has been defined in the root element.

Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. All users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. For this reason, even if you change the value of @xml:lang within a leaf <div>, there is no guarantee that readers or processors of your data will take it into account. Or if you try to represent the physical appearance of the text on the object, it is likely to be ignored. TAN rules on normalizing space and Unicode characters also prevails over any exemptions declared in TEI.

Most frequently, you will find the need to take a TEI file and prepare it for the TEI-TAN format. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps:

  1. Structure: insert new processing instructions (TAN-TEI validation files); adjust root element by supplying IRI name to @id, TAN namespace to @xmlns:tan.

  2. Metadata: create new <head> and populate it

  3. Data: edit <body> to restrict the content to a single work; restructure <body> content into nesting <div>s with correct @type and @n values.

It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming. The TAN <head> requires one to more carefully curate the metadata than does <teiHeader>. But step 3 should not be overlooked, either. Many people write TEI files with a focus on the original textual object, and they make editorial decisions that enhance comparison of the transcription with digital facsimiles of the source. Users of TAN files, however, are rather disinterested in the original sitz-im-leben of a particular text, and are rather interested in seeing how a transcription fits into the ocean of intertextuality and relates to other versions of the same work. Therefore, it is advisable to trim from the body of your TEI file any elements that would interfere with direct comparison with other versions of the text in the TAN format.