The Text Alignment Network

An Introduction

Version: 2021-08-30

What is TAN?

  • Suite of XML formats
  • Deals with TEI-like problems
  • Stand-off annotation
  • Pointers modeled on citations
  • Deep validation
  • Designed for multiple versions of texts
  • Enhanced cross-project interoperability
  • Rooted in semantics (RDF-friendly)
  • Schematron for natural language grammar
  • XSLT function library
  • Interoperable utilities/applications
  • Oxygen framework
  • Next-gen TEI / NLTK / CLTK
  • Open source
  • September 2021: Alpha release 2021
  • So much accomplished, so much to do
  • Co-developers welcome
  • textalign.net

Why bother?

The scenario: a new translation project

  • five scholars, five states
  • text written in the 4th c. CE, in Greek
  • translate the original
  • translate the 5/6th-c. translations
  • comment

Our challenges

  • fragmentary text
  • 30+ books / editions of interest
  • multiple ancient translations
  • defective, disordered versions
  • extensive text reuse
  • project's work in flux

We needed a reading space

  • every version juxtaposed
  • include our working translations
  • include quoted literature

A vision: Parabola

  • Juxtapose multiple versions, multiple works, any language
  • Make quoted works available
  • Version grouping, sorting, filtering
  • Push-button operation: underlying master files → HTML output
  • Plug-n-play: add new versions as needed

Problems, Opportunities

  • How to generalize?
  • How to distribute?
  • How to coordinate works?
  • How to coordinate reference systems?
  • How to structure interoperable citations?

Standard TEI inadequate

  • Too laissez faire
  • Pointers, syntax are project-oriented
  • teiHeader is confusing, not semantic-oriented
  • Pointers, syntax are project-oriented
  • etc.

A vision: TAN

  • Constrained, decluttered TEI
  • Annotations → simple, networked stand-off files
  • Pointers based on existing citation conventions
  • Cross-project communication
  • Streamlined metadata rooted in semantics, accessed by familiar names, abbreviations
  • Deeper validation
  • New inclusion techniques
  • So much more

Caveats

TAN Gaze

Our gaze can be directed toward a text-bearing object or away from it. Many TEI files direct our gaze toward scripta. But the TAN ecosystem directs our gaze toward the intertextual ocean.

  • Extensive normalization
  • Shared conventions

Not all problems can be fixed by a computer

  • Competing reference systems
  • Incongruous text structures
  • Presumptuous synonymity
  • Ineluctable polysemy

TAN is not traditional software

  • No GUI: to be used with an XML editor or on the command line
  • Not a web service, not an API (but can be integrated)
  • Rooted in XSLT 3.0

Work in progress

  • TAN will change
  • Some operations may be slow
  • You may encounter bugs
  • We're all learning together

Installing TAN

Installation

  • Download
  • Put anywhere
  • Open a copy of tan.xpr in Oxygen
  • Get to work

Digital assets

Organization

  • TAN's directory structure is relatively flat
  • Names of top-level folders (plural nouns) describe purpose of the contained files
  • Files you might want to use or change are close to the root
  • Development branch supports maintenance validation, tests

TAN Guidelines

TAN Guidelines

  • Docbook (master)
  • HTML (website or Oxygen transformation)
  • PDF (website or Oxygen transformation)

TAN Schemas

/schemas

  • *.rnc/*.rng for structural validation
  • *.sch for detailed validation
  • Point to schemas via processing instructions, or let Oxygen infer and apply.
  • See the guidelines

Schematron validation phases

  • terse: essential errors
  • normal: standard errors
  • verbose: deep checking
  • Schematron is just a messenger.
  • The hard work is done by the TAN library (XSLT).

Errors

See the guidelines for a list of 120+ errors

TAN Vocabulary

Vocabulary features

  • Avoids repetition
  • Tethers nicknames/abbreviations to URIs
  • Standard vocabulary: /vocabularies
  • Standard vocabulary can be overridden
  • Wide range of vocabularies: division types, grammar, licenses, roles, tokenization, etc.

TAN Parameters

/parameters

  • Adjusts validation, utilities, applications
  • Any parameter can be changed
  • Not every parameter affects every operation

Parameters: caution

  • Need to understand data types and XSLT syntax
  • Very helpful to understand regular expressions
  • Some processes may not be affected by some parameters
  • It's up to you to track of your configuration

TAN functions

TAN function library

250+ public functions
Drives schematron validation
Foundation for utilities, applications
For XSLT developers, one line: <xsl:include href="[PATH]/functions/TAN-function-library.xsl">

TAN function library coverage

Arabic, archives, arrays, attributes, binary, booleans, checksums, codepoints, colors, datatypes, diff, docx, expansion, filenames, files, Greek, grouping, html, identifiers, items, language, Latin, lexicomorphology, maps, merging, namespaces, nodes, numerals, numerics, pointers, regular expressions, resolution, search, sequences, serialization, spacing, statistics, strings, Syriac, tree manipulation, uris, versioning, vocabulary

TAN utilities

/utilities

  • Get material into TEI / TAN
  • Perform challenging / time-consuming editing
  • Managed through Oxygen transformations
  • All documentation, configuration at /utilities/[NAME]/[NAME.xsl]
  • See guidelines to understand the XSLT process

Some TAN utilities...

Body Builder

  • Convert plain text or Word docx to TAN / TEI
  • Requires strong command of regular expression
  • To be configured iteratively
  • Designed for complex workflows
  • Can save many hours of work

Body Remodeler

  • Mold a text into the div structure of a model
  • Best done incrementally, in concert with special Oxygen Author tools
  • To be configured iteratively
  • Regular expression determine where breaks are not allowed
  • Can save many hours of work

TAN applications

/applications

  • TAN for end use: teaching, publication, research
  • Complex, configurable code
  • Managed through Oxygen transformations
  • All documentation, configuration at /applications/[NAME]/[NAME.xsl]
  • See guidelines to understand the XSLT process

TAN Parabola

  • Juxtapose multiple versions of a work, along with annotations, into a single reading space
  • Oldest and most popular application
  • Sources can be grouped, sorted, filtered
  • Managed through a TAN-A file
  • Useful for publishing, teaching, research, collaboration

TAN Diff+

  • Compare 2 or more texts
  • Visualize letter-for-letter differences
  • Study statistical analyses
  • Normalize input
  • De-normalize output
  • Quantify, measure text differences
  • Useful in many areas in academics and industry

TAN Tangram

  • Detect verbatim parallels between two groups of texts
  • Designed for Greek and Latin literature
  • Disregards ngram order
  • Allows inserted words
  • Scores results across multiple metrics
  • Useful for identifying plagiarism, quotations, common themes, collocations
  • Still under development

Applications under development

  • Translation explicitation
  • Lexical complexity
  • Stylometry
The sky is the limit

TAN Formats

Format design principles

  • One format, one purpose
  • Predictable design
  • Human readable
  • Computer actionable

Format classes

  • Class 1: texts
  • Class 2: annotations on texts
  • Class 3: everything else

Class 1: Texts

  • TEI: TEI All, slightly adjusted for TAN
  • TAN-T: as close as you can get to plain text

Class 1 Features

  • One version, one work, one reference system, one scriptum
  • @xml:id normally not needed
  • Complex, aliased @ns supported
  • Normalization of space, Unicode, line breaks, references

Class 2: Annotations

  • TAN-A: general alignment, annotations
  • TAN-A-lm: lexico-morphology (part of speech)
  • TAN-A-tok: detailed word-for-word bitext alignment

Class 2 Features

  • Simple adjustments allowed
  • Pointing system based on @n and tokens
  • Tokenization defined by regular expressions
  • All annotations are claims
  • Claims designed to reflect scholarly concerns
  • Conversion to RDF (reified, complex) pending

Class 3: Everything else

  • TAN-voc: vocabulary (IRIs, names, descriptions of concepts)
  • TAN-mor: customizable rules for grammars used by TAN-A-lm files
  • catalog.tan: catalogs to index TAN files

TAN Metadata

Metadata design principles

  • Focus only on the data
  • Predictable structure
  • Eliminate repetition
  • Human readable
  • Computer actionable
  • Chain of responsibility
  • Cross-project communication

Scholarly responsibility

  • Who did what when?
  • What are your sources?
  • How do you define your terms?
  • What alterations have you made to your sources?
  • What rights do I have to use your material?

Metadata structure

  1. Declarations: name, license, numeral system, tokenization, adjustments
  2. Network
  3. Vocabulary
  4. Responsibilities, changes, work pending

Next steps

Next steps

  • Examples (/examples)
  • Other libraries
  • Guidelines
  • Tutorials

TAN

textalign.net