TAN Tutorial 1

Preparing a TEI Corpus for the Text Alignment Network

Session 3

Alignment and Annotation

Objectives

  • Develop a TAN-A file for aligning and annotating the corpus of files.
  • Expose files to a network
  • Use files on the network
  • Understand how claims (annotations) work

Alignment

  • Alignment is challenging
  • Traditional methods are arbitrary, hard to edit, administer: @xml:id, XPath, URI
  • Different things need alignment: work, version, text section, phrase, word

TAN alignment mechanisms

Unit Mechanism
Work/version IRI
Scriptum IRI
Text divisions chained @ns = @ref, optionally with <adjustments>
Phrases class 1 <div> or TAN-A
Words, characters TAN-A-tok

Annotations

Annotations

  • A text about one or more texts
  • Has one or more anchors, of variable specificity

TEI annotation

  • Excellent for limited, clear, isolatable annotations
  • More complicated cases are challenging
    • Clusters of annotations
    • Credited annotations
    • Overlapping annotations
    • Complex annotations
  • Stand-off annotation helpful, but pointing mechanisms suffer the same challenge as alignment mechanisms

The TAN approach to annotation

  • Inline TEI annotations are fine
  • For complex cases:
    • Stand-off annotation in separate files
    • Point via reference system, word (token), character
  • Annotation = claim → claimant
  • Semantic orientation, human-friendly syntax

Class 2 Files

  • Always point to class-1 sources
  • Always start "TAN-A"
  • The A stands for alignment, annotation, or both

TAN Class 2

Name Description Sources Use
TAN-A Generic alignment, annotation zero or more various
TAN-A-lm Lexicomorphology one linguistics
TAN-A-tok Bitext alignment two translation studies

TAN-A

  • Points to zero or more sources
  • Minor non-intrusive adjustments
  • Sources can be grouped via <alias>
  • <body> can be empty
  • otherwise it has <claim>s

Caveats

  • Can be time-consuming to validate on long sources, many claims
  • Exportable to RDF only theoretically
  • Editorial tools at their infancy
  • Requires vocabulary awareness

Demonstration

Skeleton TAN-A
Prepared TAN-A

Exercise 1

Create a TAN-A file

Network

  • TAN is a network
  • No file is an island
  • Files "talk" to each other
  • Decentralized
  • Cross-project
  • The key to publishing, communication: <master-location>
  • Catalogues help in discovery

TAN linking elements

Name Core? Class Description
<vocabulary> yes all Points to vocabulary essential for interpreting the file
<inclusion> yes all Points to files whose components may be incorporated wholesale into the current file
<predecessor> no all Points to a previous version of the file
<successor> no all Points to a new version of the file; implies that the current one is obsolete
<see-also> no all Points to any file
<redivision> no 1 Points to another class-1 file with an identical transcription, but in a different div structure
<model> no 1 Points to another class-1 file whose div structure exemplifies the one being used currently
<source> yes 2 Points to class-1 files
<morphology> yes 2 Used by TAN-A-lm to point to a TAN-mor file

Demonstration

Session 2 transcription
Transcription prepared for publication

Exercise 2

Prepare the network

TAN-A Claims

  • Always attributed to a claimant
  • Recursion ok (X claims that Y claims that Z...)
  • RDF model: reification, n-ary triples
  • Still experimental
  • <verb> is the heart of a claim
  • verb components can be customized, to require, disallow parts
  • claims can be negated, nuanced via @adverb

Demonstration

TAN-A without claims
Transcription prepared for publication

Exercise 3

Write claims

Caveats

  • <claim>s are an experimental technology
  • <verb>s are an experimental technology
  • Validation, utilities, applications may not perform as expected with experimental technology
  • No attempt has yet been made to export to RDF
  • Parts of works cannot yet be generally expressed via URI, except by private convention

Recap

What we've learned

  • Develop a TAN-A file for aligning and annotating the corpus of files.
  • Expose files to a network
  • Use files on the network
  • Understand how claims (annotations) work

General discussion

Finish, evaluate exercises