An Experiment in Stand-Off Linked Open Linguistic Data for TEI

The Text Alignment Network's TAN-A-lm and TAN-mor Formats

Joel Kalvesmaki
Tuesday 6 September 2023

textalign.net | kalvesmaki.com | evagriusponticus.net
email: kalvesmaki@gmail.com

Follow along

textalign.net/tei2023

About me

School/Employment

  • 1999: BA, Classics, Philosophy, University of Washington
  • 2002, 2006: MA, PhD, Early Christian Studies, Catholic University of America
  • 2004-2006, 2008-2019: editor, Dumbarton Oaks
  • 2019-present: software developer, Government Publishing Office

Research interests

  • Evagrius Ponticus
  • Number symbolism
  • Greek scholia
  • Digital humanities

Positions

  • Fellow, Catholic University of America
  • Editor, Journal of the Text Encoding Initiative
  • Editor, Christianity in Late Antiquity
  • Editor, Syriac Treasures
  • Member, XPath/XQuery/XSLT 4.0 community group
  • Advisor
    • Digital Classicist Wiki
    • Chrysostom Latinus in Iohannem Online
    • Corpus Coranicum Christianorum

Caveats and assumptions

Caveats

  • By training I am a scholar, but by profession I am a programmer, principally in XSLT and C#.
  • I’m not a linguist. My understanding of linguistics is shaped by Greek, Latin, English, Syriac, Coptic.
  • In participating in LingSIG, I want to support (not determine) the direction linguists take.

Assumptions

  • You know what TEI (and therefore XML) is.
  • You know what linked open data/RDF is.
  • You know what lexico-morphology is.
  • You know what a highly inflected language looks like.

Current TEI

Simple inline annotations

TEI Guidelines
<p>
 <w pos="PNP">We</w>
 <w pos="VBB” join="left">'re</w>
 <w pos="VVG">going</w>
 <w pos="PRP">on</w>
 <w pos="NN1">vacation</w>
 <pc pos="PUN” join="left">.</pc>
</p>
                        
Also available: @msd, @lemma

Stand-off annotations

(I couldn’t find a succinct example in time.)
  • Use of ids in the text
  • Linguistic annotations are elsewhere in the same document
  • Annotations point to the text via id refs

Some issues

  • Not quite LOD-ready (e.g., code “PNP” = what IRI?)
  • Hidden assumptions about the given tagging scheme or language (e.g., “because” = conj. or prep.?)
  • Different projects means different language models, codes
  • How do I encode ambiguity, or alternatives?
    • “Get down off a duck” (“down” = adverb or noun)
    • “ἀλλὰ ῥῦσαι ἡμᾶς ἀπὸ τοῦ πονηροῦ.” = “deliver us from evil” or “...evil one”? (masc. or neut.)

Some other TEI issues

  • Laissez faire
  • Pointers, syntax are project-oriented
  • teiHeader can be rambling, confusing, not LOD-friendly.
  • Encourages repetition, redundancy
  • Publishing aside, applications and tools are DIY, project-specific
  • etc.

Current non-TEI

    <row>
        <lexid>507350</lexid>
        <token>πονηροῦ</token>
        <code>a--s---mg-</code>
        <lemma>πονηρός</lemma>
        <alt_lsj/>
        <note/>
        <blesslemma>1</blesslemma>
        <blesslex/>
    </row>
    <row>
        <lexid>507351</lexid>
        <token>πονηροῦ</token>
        <code>a--s---ng-</code>
        <lemma>πονηρός</lemma>
        <alt_lsj/>
        <note/>
        <blesslemma>1</blesslemma>
        <blesslex>1</blesslex>
    </row>

Computer Assisted Tools for Septuagint/Scriptural Study

                                
                                    Gen 2:9
                                    . . . .
                                    PONHROU=                 A1A GSM    PONHRO/S
                                
                            

MorphGNT SBLGNT

                                
                                    010613 RA ----GSN- τοῦ τοῦ τοῦ ὁ
                                    010613 A- ----GSN- ⸀πονηροῦ. πονηροῦ πονηροῦ πονηρός
                                
                            

Some issues

  • Different syntax (a--s---ng- vs A1A GSN vs A- ----GSN-)
  • Some differences in morphological taxonomies (vocative deprecated)
  • Not LOD/RDF-ready (but could be)
  • Slightly dirty data

Principle-based desiderata

  • A linguistic annotation is a claim, not a fact.
  • A claimant (person/algorithm) makes a linguistic claim about some text at a particular time, with some level of confidence, using a syntax governed by rules determined by a linguistic perspective.
  • Each emphasized factor above deserves support in our TEI files.
  • The customary RDF triple is inadequate (RDF* perhaps not).

A vision: TAN

  • Constrained, decluttered TEI
  • Family of XML formats (only one task per format)
  • Annotations → simple, networked stand-off files
  • Annotations linked to text via citation conventions, not @id
  • Cross-project communication, interoperability
  • Streamlined metadata, rooted in semantics, accessed by familiar names, abbreviations
  • Deeper validation
  • New inclusion techniques
  • So much more

Vocabulary: TAN-voc

Examples

Features

  • Many <IRI>s encouraged: synonymity principle.
  • The <name> serves as an id for any linking file.
  • Any given file can add another, usually terser id, locally valent.
  • Every id/idref implicitly invokes the IRIs of the underlying concept.

Lexicomorphological rules: TAN-mor

Examples

Features

  • Choose the syntax you like, either positional or non-positional.
  • Declare rules for the usage of the syntax, with different levels of warning.

Linguistic annotations: TAN-A-lm

Examples: text-specific

Examples: language-specific

  • Perseus Greek (1,787 files)
  • Perseus Latin (2,138 files)

Examples: text distilled to language-specific

Features

  • Fully resolvable to LOD
  • Responsive to TAN-mor rules
  • Schematron Quick Fixes allow handy edits
  • Duplication can be reduced

Putting it together

TEI + (TAN-voc + TAN-mor + TAN-A-lm)

Rich lexicomorphological data is valuable

(Why go to all this trouble?)

Pros and cons

Cons

* = I think I can improve on this, if I had the time and impetus
  • *Time-consuming, memory-consumptive to create
  • *Time-consuming to edit
  • *Validating long, complex texts or annotations can be processor-intensive, memory-consumptive
  • Not always convenient (sometimes standoff data is a headache)
  • *Vocabulary hard to traverse in functions

Pros

  • Tiny errors can be found and fixed
  • Greater cross-project interoperability is possible
  • RDF- and RDF*-conversant
  • A model for the future
  • For applications, sky is the limit
  • Access of the largest XSLT function libraries

Status

  • 2021 version stable
  • 2021 EADH workshop
  • I am taking time off to recharge
  • Liberal license (steal my code, please)

An idea

  • LingSIG community Schematron file(s)
  • Validates files
  • Provides editorial assistance (Schematron Quick Fixes)
  • Core library behind validation can be repurposed for applications.

Thank you


email: kalvesmaki@gmail.com | director@textalign.net