Note | |
---|---|
This section is to be read in conjunction with Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta) and the section called “The Text Encoding Initiative”, which address related technical issues. |
Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to every user. In these cases, you should use TAN-TEI, a customization of the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in textual scholarship.
TEI was designed to be maximally expressive and flexible, to serve the detailed needs of scholars in the humanities. In serving this mission, TEI has come to define more than five hundred different elements, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library.
Although TEI XML is oftentimes described as a standard, it lacks charactistics one normally expects of a standard. It is very flexible, admits flavors and interpretation, and is best used when it is customized. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major difference between TEI All and TAN-TEI is that the latter imposes extra strictures, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files.
All TEI files are validated against a TEI-conformant schema normally as an XML DTD, RELAX NG, or W3C Schema.
TAN's TEI-conformant schema is based upon the TAN-TEI.odd
file in the
schemas
directory, converted to a RELAX-NG file, TEI.rnc
and TEI.rng
, to define the structural rules of TAN-TEI files. There is
an additional layer of validation, through the related Schematron process
(TEI.sch
), which performs detailed validation not possible in a
TEI-conformant schema. In the discussion below, it is important to distinguish
between structural validation and Schematron validation. See the section called “The TAN Validation Process”.
TAN's customization of the TEI can be summarized as follows (the default namespace
in this section is the TEI namespace,
http://www.tei-c.org/ns/1.0
):
Table 5.3. Synopsis of TAN-TEI customization
TEI element | Strictures |
---|---|
<TEI> |
|
<text> |
|
<body> | |
<div> |
|
TAN-TEI files have two heads, which may strike you as strange. Each head does
something different, and was designed for different purposes. Whereas the TAN <head>
is meant to be brief and
restricted to only those matters relevant to the transcription, the
<teiHeader>
permits quite an expansive range of metadata, and may
be used to encode a variety of things, including those that are tangential or
irrelevant to the data. Unlike the TAN <head>
, whose data is designed to be both computer- and
human-readable, <teiHeader>
was designed for data to be read
principally by humans; although it can accommodate IRIs, it was not designed around
them. Further, a TAN <head>
can
never be empty and valid; a bare-bones <teiHeader>
with no actual
text content, such as the following, is considered
valid:
<teiHeader> <fileDesc> <titleStmt><title/></titleStmt> <publicationStmt><p/></publicationStmt> <sourceDesc><p/></sourceDesc> </fileDesc> </teiHeader>
TAN's Schematron validation process ignores the contents of
<teiHeader>
, since its contents are unpredictable and therefore
not reliably parsable. If your <teiHeader>
has any kind of metadata
that needs to appear in the TAN <head>
(see the section called “Metadata (<head>)” and the section called “Principles and Assumptions”), the conversion needs to be performed
manually, since (as mentioned above) the two headers are incommensurate, and writing
each one requires a different mentality.
In a TAN-TEI file, the TAN <head>
must be in the TAN namespace, i.e., <head
xmlns="tag:textalign.net,2015:ns">
(or <tan:head
xmlns:tan="tag:textalign.net,2015:ns">
, but this would require all
descendant elements to be prefixed tan:
).
Within any leaf <div>
, you may
use whatever TEI markup you wish, to whatever level of depth or complexity. Most
users of your TAN-TEI file will be interested in the text; only a subset will care
about any markup within leaf <div>
s.
TEI files are flexible, permitting different approaches to markup. A TAN-TEI file should not be scriptum-oriented, i.e., it should not try to replicate how the text appears or looks on the object.
You may have a TEI file that you wish to convert to TAN-TEI. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps:
Structure: insert new processing instructions (pointing to files to
perform TAN-TEI structural and Schematron validation); adjust root element
by supplying a tag URN for @id
and @TAN-version
.
Metadata: create new <head
xmlns="tag:textalign.net,2015:ns">
and populate
it.
Data: edit <body>
to
make sure all text nodes are restricted to the content of a single version
of a single work; restructure <body>
content into nesting <div>
s with correct @type
and @n
values.
It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. Some TEI files have been written with little attention paid to space and space normalization. Some TEI files are so laden with annotations that the text is impossible to read. In general, the more simple the TEI file the better, with annotations pushed to external files.
Some TEI markup is already implicit, or is easily calculable (e.g.,
<w>
to mark words, which should already comport with the
tokenization declared in the <head>
; users of <w>
easily lose track of where
space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g.,
lexico-morphological data, which should be expressed in a TAN-A-lm file).