Mining, Measuring, and Managing Textual Variation with TAN Diff+

Joel Kalvesmaki

4 April 2022, University of Bergen

Follow along

textalign.net/bergen

Background

(handout)

Text Difference

Well-known problem
Standard algorithm
Longest common subsequence
Numerous software packages

High-demand text differencing

N-way; normalization; weighted scoring; statistical analysis; transpositions; text as structure; big data; pipelines
Requires building a solution via function library
- Python: ndiff
- 8 languages: diff-patch-match
- XSLT: TAN Diff+
- Write your own

A Quick TAN

TAN

Text Alignment Network
http://textalign.net
To enhance semantic, syntactic interoperability of texts and textual annotations
Unique, extensive XSLT function library
GNU General Public License

TAN Function Library

> 250 functions
Many areas: strings, numerals, regular expressions, maps, arrays, checksums, etc.
Extensible
Customisable

tan:diff()

Algorithm: staggered samples
Runs in logarithmic time
Optimization not needed
Very high quality output
Default: snap to word
Under the hood: character-by-character

TAN Difference Functions

tan:diff()
tan:collate()
tan:collate-pair-of-sequences()
tan:diff-to-delta()
tan:levenshtein-distance()
tan:replace-diff()
tan:replace-collation()
tan:get-diff-output-transpositions()

TAN Diff+

Features (1/3)

Wraps tan:diff() and friends for common needs
XSLT application
Heuristic tool (not an editor) for specialists in industry, academia
N-way comparisons
Plain text (XML) user interface with lots of annotation
Extensible, customisable (liberal license)

Features (2/3)

Input: docx, XML, plain text (or folders)
preprocessing
- normalization (case, space, punctuation)
- custom text alteration
- custom structure alteration, filtration
- serialization
- special normalization for Greek, Latin, Syriac
postprocessing
- de-normalization
- statistics (Venn style)
- Venn diagrams

Features (3/3)

XML Output
- Master data
- Base output + statistics, metadata
- Can be repurposed
HTML output
- Dynamic, filterable
- Statistical tables
- Venn diagrams
- Normalizations can be reverted
- Truncation possible

Possible uses

Compare multiple versions of an article, book chapter
Determine OCR accuracy
Detect manuscript relationships
Workflow analysis

Sample output

Two-way diff

Three-way diff

Adjustments

<test>
	   <s attribute="a">Philosophy gives no pictures of reality.</s>
</test>

<test>
	   <s attribute="x" another="test"> philosophy gives nö
	      <b/> pictures of reality </s>
</test>

Other output

(time & interest permitting)

Aporia

Weighting scores?

Perhaps easy to express (alteration pairs + score modifier)

Hard to implement

Finding diffs in diffs
Ambiguity
False positives

Comparing, scoring tree differences?

Hard to express

Impossible to implement without some expression mechanism

Running TAN Diff+

(time & interest permitting)

Setup

Download TAN
Install Oxygen (optional)
Create a folder with (only) the files you want to compare (docx, plain text, or XML)
Read documentation

Option 1: Oxygen

Open TAN.xpr in Oxygen
Open applications/Diff+/Diff+.xsl
Click Configure Transformation Scenarios (ctrl+shift+T)
Pick and apply APP: Diff+, ask for input directory, filename filters
Follow prompts (Oxygen GUI needs help)
Read messages / open output

Option 2: No GUI

Make sure Java is installed, or install it.
Download Saxon HE (any version)
Rename Saxon file saxon.jar and move to TAN's processors folder
Windows users: drag files/folders on to Diff+.bat
Mac users: open terminal, build a command line instruction to Saxon processor

Option 3: XSLT development

Include or import Diff+.xsl
Start programming

Future Work

TAN Diff+ To-Do

Tutorial
docx output (in prototype)
Show transpositions
Test suite
Increase parameterization

Two TAN Alternatives

Direct Development

Build your own XSLT application with the TAN function library

Immediate access to numerous original tools
Use a different scoring method, e.g., tan:levenshtein-distance()
Maintain differences, restore versions with tan:diff-to-delta() and tan:apply-deltas()
Detect transpositions
Do mass analysis (I did this on Wittgenstein subset 1)

TAN Tangram

For dissimilar documents
Quotation detection, parallels, collocations, common themes

How it works

Token-based atoms ("token" defined as you like)
Tokens converted to one or more aliases
Aliases so far are lexemes, but could be anything
Focus on rare aliases; initially ignore common ones
Build 1-grams
Build 2-grams based on aura

Sample output (so far)

Early test: Evagrius and Gregory
Early test: Search across 6 works
Latest: Evagrius and Didymus, scholia on Proverbs (latest)

Problems

Computationally intensive
Good alias sources hard to find
Elusive quotations: 1-grams ("Doh!"), lack of rare words ("To be or not to be.")
Noise

Opportunities

Already used to make original discoveries
Enormously beneficial for highly inflected languages
Not bound to word order, insertions, padding
No need to go beyond 3-grams (clustering principle)
In tests has outperformed TLG, e-TRACER

Our questions

What do we, in different contexts, mean by document similarity?

If identity is the extreme case of similarity, how do we define document or text identity?

In what properties are you interested?
In what properties are you disinterested?
What are your atoms?
How are your atoms organized?

How do we make sure that the results of digital methods match up with our intuitions?

Onus on the creator of the digital method
Data model stage: document your choices
Input stage: parameterize! Compel your user to declare assumptions, preferences
Algorithm stage: declare, document your intentions, expectations
Output stage: report all choices, assumptions made
Test, revise
If an intuition cannot (yet) be modeled, write a TODO or warning

And if they don't, can we still learn something useful from the application of such methods?

Afterthoughts

I am available for hands-on configuration

Consider submitting an article to jTEI

Thank you

Joel Kalvesmaki

kalvesmaki@gmail.com / director@textalign.net