TAN Tutorial 1

Preparing a TEI Corpus for the Text Alignment Network

Session 3

Alignment and Annotation

Objectives

Develop a TAN-A file for aligning and annotating the corpus of files.
Expose files to a network
Use files on the network
Understand how claims (annotations) work

Alignment

Alignment is challenging
Traditional methods are arbitrary, hard to edit, administer: @xml:id, XPath, URI
Different things need alignment: work, version, text section, phrase, word

TAN alignment mechanisms

Unit	Mechanism
Work/version	IRI
Scriptum	IRI
Text divisions	chained `@n`s = `@ref`, optionally with `<adjustments>`
Phrases	class 1 `<div>` or TAN-A
Words, characters	TAN-A-tok

Annotations

Annotations

A text about one or more texts
Has one or more anchors, of variable specificity

TEI annotation

Excellent for limited, clear, isolatable annotations
More complicated cases are challenging
- Clusters of annotations
- Credited annotations
- Overlapping annotations
- Complex annotations
Stand-off annotation helpful, but pointing mechanisms suffer the same challenge as alignment mechanisms

The TAN approach to annotation

Inline TEI annotations are fine
For complex cases:
- Stand-off annotation in separate files
- Point via reference system, word (token), character
Annotation = claim → claimant
Semantic orientation, human-friendly syntax

Class 2 Files

Always point to class-1 sources
Always start "TAN-A"
The A stands for alignment, annotation, or both

TAN Class 2

Name	Description	Sources	Use
TAN-A	Generic alignment, annotation	zero or more	various
TAN-A-lm	Lexicomorphology	one	linguistics
TAN-A-tok	Bitext alignment	two	translation studies

TAN-A

Points to zero or more sources
Minor non-intrusive adjustments
Sources can be grouped via <alias>
<body> can be empty
otherwise it has <claim>s

Caveats

Can be time-consuming to validate on long sources, many claims
Exportable to RDF only theoretically
Editorial tools at their infancy
Requires vocabulary awareness

Demonstration

Exercise 1

Create a TAN-A file

Network

TAN is a network
No file is an island
Files "talk" to each other
Decentralized
Cross-project
The key to publishing, communication: <master-location>
Catalogues help in discovery

TAN linking elements

Name	Core?	Class	Description
`<vocabulary>`	yes	all	Points to vocabulary essential for interpreting the file
`<inclusion>`	yes	all	Points to files whose components may be incorporated wholesale into the current file
`<predecessor>`	no	all	Points to a previous version of the file
`<successor>`	no	all	Points to a new version of the file; implies that the current one is obsolete
`<see-also>`	no	all	Points to any file
`<redivision>`	no	1	Points to another class-1 file with an identical transcription, but in a different div structure
`<model>`	no	1	Points to another class-1 file whose div structure exemplifies the one being used currently
`<source>`	yes	2	Points to class-1 files
`<morphology>`	yes	2	Used by TAN-A-lm to point to a TAN-mor file

Demonstration

Session 2 transcription

Transcription prepared for publication

Exercise 2

Prepare the network

TAN-A Claims

Always attributed to a claimant
Recursion ok (X claims that Y claims that Z...)
RDF model: reification, n-ary triples
Still experimental
<verb> is the heart of a claim
verb components can be customized, to require, disallow parts
claims can be negated, nuanced via @adverb

Demonstration

TAN-A without claims

Transcription prepared for publication

Exercise 3

Caveats

<claim>s are an experimental technology
<verb>s are an experimental technology
Validation, utilities, applications may not perform as expected with experimental technology
No attempt has yet been made to export to RDF
Parts of works cannot yet be generally expressed via URI, except by private convention

Recap

What we've learned

Develop a TAN-A file for aligning and annotating the corpus of files.
Expose files to a network
Use files on the network
Understand how claims (annotations) work

General discussion

Finish, evaluate exercises