Mining, Measuring, and Managing Textual Variation with TAN Diff+

Joel Kalvesmaki
4 April 2022, University of Bergen

Follow along

textalign.net/bergen

Background

(handout)

Text Difference

  • Well-known problem
  • Standard algorithm
  • Longest common subsequence
  • Numerous software packages

High-demand text differencing

  • N-way; normalization; weighted scoring; statistical analysis; transpositions; text as structure; big data; pipelines
  • Requires building a solution via function library
    • Python: ndiff
    • 8 languages: diff-patch-match
    • XSLT: TAN Diff+
    • Write your own

A Quick TAN

TAN

  • Text Alignment Network
  • http://textalign.net
  • To enhance semantic, syntactic interoperability of texts and textual annotations
  • Unique, extensive XSLT function library
  • GNU General Public License

TAN Function Library

  • > 250 functions
  • Many areas: strings, numerals, regular expressions, maps, arrays, checksums, etc.
  • Extensible
  • Customisable

tan:diff()

  • Algorithm: staggered samples
  • Runs in logarithmic time
  • Optimization not needed
  • Very high quality output
  • Default: snap to word
  • Under the hood: character-by-character

TAN Difference Functions

  • tan:diff()
  • tan:collate()
  • tan:collate-pair-of-sequences()
  • tan:diff-to-delta()
  • tan:levenshtein-distance()
  • tan:replace-diff()
  • tan:replace-collation()
  • tan:get-diff-output-transpositions()

TAN Diff+

Features (1/3)

  • Wraps tan:diff() and friends for common needs
  • XSLT application
  • Heuristic tool (not an editor) for specialists in industry, academia
  • N-way comparisons
  • Plain text (XML) user interface with lots of annotation
  • Extensible, customisable (liberal license)

Features (2/3)

  • Input: docx, XML, plain text (or folders)
  • preprocessing
    • normalization (case, space, punctuation)
    • custom text alteration
    • custom structure alteration, filtration
    • serialization
    • special normalization for Greek, Latin, Syriac
  • postprocessing
    • de-normalization
    • statistics (Venn style)
    • Venn diagrams

Features (3/3)

  • XML Output
    • Master data
    • Base output + statistics, metadata
    • Can be repurposed
  • HTML output
    • Dynamic, filterable
    • Statistical tables
    • Venn diagrams
    • Normalizations can be reverted
    • Truncation possible

Possible uses

  • Compare multiple versions of an article, book chapter
  • Determine OCR accuracy
  • Detect manuscript relationships
  • Workflow analysis

Sample output

Two-way diff

Three-way diff

Adjustments

<test>
	   <s attribute="a">Philosophy gives no pictures of reality.</s>
</test>
<test>
	   <s attribute="x" another="test"> philosophy gives nö
	      <b/> pictures of reality </s>
</test>

Other output

(time & interest permitting)

Aporia

Weighting scores?

Perhaps easy to express (alteration pairs + score modifier)
Hard to implement
  • Finding diffs in diffs
  • Ambiguity
  • False positives

Comparing, scoring tree differences?

Hard to express
Impossible to implement without some expression mechanism

Running TAN Diff+

(time & interest permitting)

Setup

  • Download TAN
  • Install Oxygen (optional)
  • Create a folder with (only) the files you want to compare (docx, plain text, or XML)
  • Read documentation

Option 1: Oxygen

  1. Open TAN.xpr in Oxygen
  2. Open applications/Diff+/Diff+.xsl
  3. Click Configure Transformation Scenarios (ctrl+shift+T)
  4. Pick and apply APP: Diff+, ask for input directory, filename filters
  5. Follow prompts (Oxygen GUI needs help)
  6. Read messages / open output

Option 2: No GUI

  1. Make sure Java is installed, or install it.
  2. Download Saxon HE (any version)
  3. Rename Saxon file saxon.jar and move to TAN's processors folder
  4. Windows users: drag files/folders on to Diff+.bat
  5. Mac users: open terminal, build a command line instruction to Saxon processor

Option 3: XSLT development

  1. Include or import Diff+.xsl
  2. Start programming

Future Work

TAN Diff+ To-Do

  • Tutorial
  • docx output (in prototype)
  • Show transpositions
  • Test suite
  • Increase parameterization

Two TAN Alternatives

Direct Development

Build your own XSLT application with the TAN function library
  • Immediate access to numerous original tools
  • Use a different scoring method, e.g., tan:levenshtein-distance()
  • Maintain differences, restore versions with tan:diff-to-delta() and tan:apply-deltas()
  • Detect transpositions
  • Do mass analysis (I did this on Wittgenstein subset 1)

TAN Tangram

TAN Tangram

  • For dissimilar documents
  • Quotation detection, parallels, collocations, common themes

How it works

  • Token-based atoms ("token" defined as you like)
  • Tokens converted to one or more aliases
  • Aliases so far are lexemes, but could be anything
  • Focus on rare aliases; initially ignore common ones
  • Build 1-grams
  • Build 2-grams based on aura

Sample output (so far)

Problems

  • Computationally intensive
  • Good alias sources hard to find
  • Elusive quotations: 1-grams ("Doh!"), lack of rare words ("To be or not to be.")
  • Noise

Opportunities

  • Already used to make original discoveries
  • Enormously beneficial for highly inflected languages
  • Not bound to word order, insertions, padding
  • No need to go beyond 3-grams (clustering principle)
  • In tests has outperformed TLG, e-TRACER

Our questions

What do we, in different contexts, mean by document similarity?
If identity is the extreme case of similarity, how do we define document or text identity?
  • In what properties are you interested?
  • In what properties are you disinterested?
  • What are your atoms?
  • How are your atoms organized?
How do we make sure that the results of digital methods match up with our intuitions?
  • Onus on the creator of the digital method
  • Data model stage: document your choices
  • Input stage: parameterize! Compel your user to declare assumptions, preferences
  • Algorithm stage: declare, document your intentions, expectations
  • Output stage: report all choices, assumptions made
  • Test, revise
  • If an intuition cannot (yet) be modeled, write a TODO or warning
And if they don't, can we still learn something useful from the application of such methods?

Afterthoughts

I am available for hands-on configuration
Consider submitting an article to jTEI

Thank you

Joel Kalvesmaki
kalvesmaki@gmail.com / director@textalign.net