The Text Alignment Network

An Introduction

Version: 2021-08-30

What is TAN?

Suite of XML formats
Deals with TEI-like problems
Stand-off annotation
Pointers modeled on citations
Deep validation

Designed for multiple versions of texts
Enhanced cross-project interoperability
Rooted in semantics (RDF-friendly)
Schematron for natural language grammar

XSLT function library
Interoperable utilities/applications
Oxygen framework
Next-gen TEI / NLTK / CLTK

Open source
September 2021: Alpha release 2021
So much accomplished, so much to do
Co-developers welcome
textalign.net

Why bother?

The scenario: a new translation project

five scholars, five states
text written in the 4th c. CE, in Greek
translate the original
translate the 5/6th-c. translations
comment

Our challenges

fragmentary text
30+ books / editions of interest
multiple ancient translations
defective, disordered versions
extensive text reuse
project's work in flux

We needed a reading space

every version juxtaposed
include our working translations
include quoted literature

A vision: Parabola

Juxtapose multiple versions, multiple works, any language
Make quoted works available
Version grouping, sorting, filtering
Push-button operation: underlying master files → HTML output
Plug-n-play: add new versions as needed

Problems, Opportunities

How to generalize?
How to distribute?
How to coordinate works?
How to coordinate reference systems?
How to structure interoperable citations?

Standard TEI inadequate

Too laissez faire
Pointers, syntax are project-oriented
teiHeader is confusing, not semantic-oriented
Pointers, syntax are project-oriented
etc.

A vision: TAN

Constrained, decluttered TEI
Annotations → simple, networked stand-off files
Pointers based on existing citation conventions
Cross-project communication
Streamlined metadata rooted in semantics, accessed by familiar names, abbreviations
Deeper validation
New inclusion techniques
So much more

Caveats

TAN Gaze

Our gaze can be directed toward a text-bearing object or away from it. Many TEI files direct our gaze toward scripta. But the TAN ecosystem directs our gaze toward the intertextual ocean.

Extensive normalization
Shared conventions

Not all problems can be fixed by a computer

Competing reference systems
Incongruous text structures
Presumptuous synonymity
Ineluctable polysemy

TAN is not traditional software

No GUI: to be used with an XML editor or on the command line
Not a web service, not an API (but can be integrated)
Rooted in XSLT 3.0

Work in progress

TAN will change
Some operations may be slow
You may encounter bugs
We're all learning together

Installing TAN

Installation

Download
Put anywhere
Open a copy of tan.xpr in Oxygen
Get to work

Digital assets

Organization

TAN's directory structure is relatively flat
Names of top-level folders (plural nouns) describe purpose of the contained files
Files you might want to use or change are close to the root
Development branch supports maintenance validation, tests

TAN Guidelines

Docbook (master)
HTML (website or Oxygen transformation)
PDF (website or Oxygen transformation)

TAN Schemas

/schemas

*.rnc/*.rng for structural validation
*.sch for detailed validation
Point to schemas via processing instructions, or let Oxygen infer and apply.
See the guidelines

Schematron validation phases

terse: essential errors
normal: standard errors
verbose: deep checking
Schematron is just a messenger.
The hard work is done by the TAN library (XSLT).

Errors

See the guidelines for a list of 120+ errors

TAN Vocabulary

Vocabulary features

Avoids repetition
Tethers nicknames/abbreviations to URIs
Standard vocabulary: /vocabularies
Standard vocabulary can be overridden
Wide range of vocabularies: division types, grammar, licenses, roles, tokenization, etc.

TAN Parameters

/parameters

Adjusts validation, utilities, applications
Any parameter can be changed
Not every parameter affects every operation

Parameters: caution

Need to understand data types and XSLT syntax
Very helpful to understand regular expressions
Some processes may not be affected by some parameters
It's up to you to track of your configuration

TAN functions

TAN function library

250+ public functions

Drives schematron validation

Foundation for utilities, applications

For XSLT developers, one line: <xsl:include href="[PATH]/functions/TAN-function-library.xsl">

TAN function library coverage

Arabic, archives, arrays, attributes, binary, booleans, checksums, codepoints, colors, datatypes, diff, docx, expansion, filenames, files, Greek, grouping, html, identifiers, items, language, Latin, lexicomorphology, maps, merging, namespaces, nodes, numerals, numerics, pointers, regular expressions, resolution, search, sequences, serialization, spacing, statistics, strings, Syriac, tree manipulation, uris, versioning, vocabulary

TAN utilities

/utilities

Get material into TEI / TAN
Perform challenging / time-consuming editing
Managed through Oxygen transformations
All documentation, configuration at /utilities/[NAME]/[NAME.xsl]
See guidelines to understand the XSLT process

Some TAN utilities...

Body Builder

Convert plain text or Word docx to TAN / TEI
Requires strong command of regular expression
To be configured iteratively
Designed for complex workflows
Can save many hours of work

Body Remodeler

Mold a text into the div structure of a model
Best done incrementally, in concert with special Oxygen Author tools
To be configured iteratively
Regular expression determine where breaks are not allowed
Can save many hours of work

TAN applications

/applications

TAN for end use: teaching, publication, research
Complex, configurable code
Managed through Oxygen transformations
All documentation, configuration at /applications/[NAME]/[NAME.xsl]
See guidelines to understand the XSLT process

TAN Parabola

Juxtapose multiple versions of a work, along with annotations, into a single reading space
Oldest and most popular application
Sources can be grouped, sorted, filtered
Managed through a TAN-A file
Useful for publishing, teaching, research, collaboration

TAN Diff+

Compare 2 or more texts
Visualize letter-for-letter differences
Study statistical analyses
Normalize input
De-normalize output
Quantify, measure text differences
Useful in many areas in academics and industry

TAN Tangram

Detect verbatim parallels between two groups of texts
Designed for Greek and Latin literature
Disregards ngram order
Allows inserted words
Scores results across multiple metrics
Useful for identifying plagiarism, quotations, common themes, collocations
Still under development

Applications under development

Translation explicitation
Lexical complexity
Stylometry

The sky is the limit

TAN Formats

Format design principles

One format, one purpose
Predictable design
Human readable
Computer actionable

Format classes

Class 1: texts
Class 2: annotations on texts
Class 3: everything else

Class 1: Texts

TEI: TEI All, slightly adjusted for TAN
TAN-T: as close as you can get to plain text

Class 1 Features

One version, one work, one reference system, one scriptum
@xml:id normally not needed
Complex, aliased @ns supported
Normalization of space, Unicode, line breaks, references

Class 2: Annotations

TAN-A: general alignment, annotations
TAN-A-lm: lexico-morphology (part of speech)
TAN-A-tok: detailed word-for-word bitext alignment

Class 2 Features

Simple adjustments allowed
Pointing system based on @n and tokens
Tokenization defined by regular expressions
All annotations are claims
Claims designed to reflect scholarly concerns
Conversion to RDF (reified, complex) pending

Class 3: Everything else

TAN-voc: vocabulary (IRIs, names, descriptions of concepts)
TAN-mor: customizable rules for grammars used by TAN-A-lm files
catalog.tan: catalogs to index TAN files

TAN Metadata

Metadata design principles

Focus only on the data
Predictable structure
Eliminate repetition
Human readable
Computer actionable
Chain of responsibility
Cross-project communication

Scholarly responsibility

Who did what when?
What are your sources?
How do you define your terms?
What alterations have you made to your sources?
What rights do I have to use your material?

Metadata structure

Declarations: name, license, numeral system, tokenization, adjustments
Network
Vocabulary
Responsibilities, changes, work pending

Next steps

Examples (/examples)
Other libraries
Guidelines
Tutorials

TAN

textalign.net