Transcriptions Using the Text Encoding Initiative (<TEI>)


This section is to be read in conjunction with Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta) and the section called “The Text Encoding Initiative”, which address related technical issues.

Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to most users. In these cases, you should use TAN-TEI, an extension to the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in scholarship.

TEI was designed to be maximally expressive and flexible, to serve the detailed needs of humanities scholars. In serving this mission, TEI has come to define more than five hundred different element names, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library.

Although the TEI format is oftentimes seen as a standard, it lacks some of the charactistics one normally expects in a standard. It is very flexible, admits flavors and interpretation, and has been designed to encourage customization. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations. The major difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files.

TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace,

Table 5.1. Synopsis of TAN-TEI customization

TEI elementsummary of alteration
  • must have @id with IRI name

  • should take new namespace declaration, xmlns:tan=",2015:ns"

  • takes a new child element, <head>, placed between <teiHeader> and <text>

  • Only the child <body> will be considered. <front> and <back> will be ignored.

  • must take @xml:lang

  • may take @in-progress

  • must take exclusively one or more <div>s

  • any elements or text between <div>s will be ignored

  • contents must be restricted to a single work

  • any and all text nodes will be treated as part of the transcription


Like all other TAN files, the root elements of TAN-TEI files must take an @id, the IRI name. See above, the section called “Tag URNs”.

TAN-TEI files have two heads, which may strike you as odd. The TEI head and the TAN head were designed for different purposes. Whereas the TAN <head> is meant to be brief and keyed to both IRIs and human-readable data, the <teiHeader> permits quite an expansive range of metadata, and about matters that bear only indirectly on the transcription (e.g., manuscript descriptions). Further, <teiHeader> was designed to be read principally by humans.

Processors of TAN-TEI files will in general ignore the contents of <teiHeader>, since the contents are unpredictable. If your <teiHeader> has any kind of metadata relevant to TAN users, you will need first to create a standard TAN <head> (see the section called “Metadata (<head>)” and the section called “Principles and Assumptions”). This conversion needs to be performed manually, since the two headers are incommensurate, and writing each one requires a different kind of mentality.

In a TAN-TEI file, the TAN <head> must take the TAN namespace, i.e., <head xmlns=",2015:ns"> or <tan:head> if the prefix tan: has been defined in the root element.

Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. All users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. For this reason, even if you change the value of @xml:lang within a leaf <div>, there is no guarantee that readers or processors of your data will take it into account.

TAN-TEI should not be used to try to represent the physical appearance of the text on the object.

You may need to prepare a TEI file to be TAN compliant. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps:

  1. Structure: insert new processing instructions (TAN-TEI validation files); adjust root element by supplying IRI name to @id, TAN namespace to @xmlns:tan.

  2. Metadata: create new <head> and populate it

  3. Data: edit <body> to restrict the content to a single work; restructure <body> content into nesting <div>s with correct @type and @n values.

It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming. The TAN <head> requires one to more carefully curate the metadata than does <teiHeader>. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. In general, the more simple the TEI file the better.