TAN Tutorial 1

Preparing a TEI Corpus for the Text Alignment Network

Session 2

Transcriptions

Objectives

Adjust TEI files to be TAN-compliant.
Coordinate transcriptions with vocabulary.
Understand how TAN handles space, Unicode, and word division, and other types of normalization.
Learn how to handle reference systems.

Transcriptions

Varieties of TEI

Sample	Gaze	Comments
Averroes, Commentum super libro Peryermenias, transl. William of Luna	Mixed	Many invalid structures; inconsistencies
Biblioteca Apostolica Vaticana, Vat. Lat. 3827, fol. 74v-75r	Scriptum	More metadata than transcription
Decree of Cyrene for the citizenship of the Thereans, Therean 'decree' and founders' oath	Scriptum	Contains bibliography, text, translations, apparatus, commentary
Kungliga biblioteket, A 13	Scriptum	Only manuscript description; no body text
Shakespeare, The Phoenix and the Turtle	Intertext	See inline comments
Plato, Euthyphro, Burnet translation	Intertext	See inline comments

Suitability for TAN

Intertext-orientation
Normalization
Individual files restricted to:
- one work
- one version
- one scriptum
- one reference system

Not all TEI files are appropriate for TAN, and that's ok.

TAN Transcriptions

Class 1 files

First of three classes: transcriptions
Format: TAN-T or TAN-TEI
Metadata focused on transcription
TEI: based on TEI All
TAN-T: "plain text" TEI

The transcription

One work, one version, one scriptum, one reference system
Competing reference systems? Use TEI anchors or TAN <redivision>
Core structure: <div>s
@type tethered to URIs
@n permits synonyms

TAN-ifying TEI

`<TEI>`

must have @id with tag URN
must have @TAN-version
takes <head>, placed between <teiHeader> and <text>, in the TAN namespace, tag:textalign.net,2015:ns; this head is rooted in RDF semantics and focuses exclusively on the transcription

`<text>`

No extra strictures, but during Schematron validation (not RELAX-NG), this element and any children <front> and <back> will be ignored. Of its children, only <body> will be Schematron validated.

`<body>`

must take @xml:lang
any non-<div> children will be ignored during Schematron validation
contents must be restricted to a single version of a single work
all descendant text nodes will be treated as the transcription

`<div>`

may encompass a textual division of whatever size you like (like HTML)
all children elements are either (1) only <div>s or anchors such as <pb> or (2) non-<div>s
must take @type and @n (or only @include)
@type may take multiple values, space delimited
@n must match: [\w\._]+([\- ,]+[\w\._]+)*. Allows synonyms, sequences, ranges.

Demonstration

Starting TEI file

TAN-TEI file (except head)

Exercise 1

Adjusting the TEI body

Demonstration

TAN-TEI file from previous exercise

Finished TAN-TEI file

Exercise 2

Adjust the TAN-TEI head

Normalization

TAN is intertext-oriented, so normalization should be applied liberally

Space
Punctuation
Ligatures
Case
Orthography
Line breaks

Normalization Enforcement

Unicode NFC
Space normalization: every <div> has an assumed single space unless it ends in a discretionary hyphen or no-break character.
Values of @n (and full references) should not appear in the text itself. (Use @rend if need be.)

Exercise 3

Exploring TAN normalization

Reference Systems

Reference systems

All div types can be classified as material or logical.
Reference systems always grounded in a particular scriptum, not necessarily the source.
Multiple reference systems? Create separate files, linked via <redivision>
Syntax
- hyphen: range || 1-3 = division 1 through division 3 (range)
- comma: "and" (parataxis) || 1, 3 = division 1, then division 3 (sequence not hierarchy)
- period, colon, etc.: levels of hierarchy (hypotaxis) || 1.3 = division 1 subdivision 3 (hierarchy not sequence)

`@n` basics

Strings, numerals, or both
Not case sensitive (ABC = abc)
Numeration systems supported: Arabic, Roman, letter, Arabic + letter, letter + Arabic
Ambiguity of Roman vs. letter resolved with <numerals priority=""> (e.g., does "c" mean 3 or 100?)
Simple numeration, e.g., n="1" n="i" n="a" n="1a"
Complex numeration, e.g., n="1-4" n="2a-2d, 2f"
Non-numeration, e.g., n="matt"

`@n` advanced

Referring to Biblical books, generic div types (e.g., title, epilogue) has been difficult because everyone uses their own abbreviations.
TAN @n synonymity reconciles divergent practices.
Must be invoked, e.g, <vocabulary which="bible eng"/>
<n-alias div-type="">, optional, is used to enhance performance for synonyms of @n

Exercise 4

Reference systems

Recap

What we've learned

Adjust TEI files to be TAN-compliant.
Coordinate transcriptions with vocabulary.
Understand how TAN handles space, Unicode, and word division, and other types of normalization.
Learn how to handle reference systems.

General discussion

Finish, evaluate exercises

TAN Tutorial 1

Preparing a TEI Corpus for the Text Alignment Network

Session 2

Transcriptions

Objectives

Transcriptions

Varieties of TEI

Suitability for TAN

TAN Transcriptions

Class 1 files

The transcription

TAN-ifying TEI

<TEI>

<text>

<body>

<div>

Demonstration

Exercise 1

Demonstration

Exercise 2

Normalization

Normalization Enforcement

Exercise 3

Reference Systems

Reference systems

@n basics

@n advanced

Exercise 4

Recap

What we've learned

General discussion

Finish, evaluate exercises

`<TEI>`

`<text>`

`<body>`

`<div>`

`@n` basics

`@n` advanced