The Text Alignment Network: Official Guidelines

Revision History
Revision 1 dev2017-05-24

Working draft. Please send corrections to the author (see above).

Formats: HTMLPDFDocbook (master)

[Warning]Warning

In case of contradictions, apparent or not, between these guidelines and the core TAN files, priority should be given to the RELAX-NG schemas (compact syntax), then to the functions, and then to these guidelines.

Chapters 1-7 and 10 are written by hand, and are relatively accurate. Chapters 8, 9, 11, and 12 are written by an algorithm that selectively reformats normative TAN files. Errors or inconsistencies found in those chapters will be due to the XSLT stylesheets that produce them or to the files upon which they are based.


Table of Contents

I. General Overview
1. Introduction
Definition and purpose
Rationale and Purpose
Design Principles
Participation
2. Starting off with the TAN Format
Creating TAN Transcription and Alignment Data
The Principles of TAN Metadata (<head>)
Creating TAN Metadata (<head>)
Aligning across Projects
II. Detailed Description
3. General Underpinnings
The Big Picture
Assumptions in the Creation of TAN Data
Core Technology
Unicode
eXtensible Markup Language (XML)
Namespaces
The Text Encoding Initiative
Data types
Identifiers and Their Use
Regular Expressions
Interpretation of multiple values
4. Patterns and Structures Common to All TAN Encoding Formats
Common Patterns
IRI + name Pattern
Digital Entity Metadata Pattern
Edit Stamp
Overall Structure (root)
@id and a TAN file's IRI Name
Metadata (<head>)
Rights and Licenses
Inclusions and Keys
Distinguishing <source>s and <see-also>s
Interpretation of inheritable attributes
Defining Words and Tokens
5. Class-1 TAN Files, Representations of Textual Objects (Scripta)
Principles and Assumptions
General
Domain model
One version, one work, one object, one reference system
Normalizing transcriptions
Transcriptions
Flattened References, and the Leaf Div Uniqueness Rule
Transcriptions Using the Text Encoding Initiative (<TEI>)
6. Class-2 TAN Files, Annotations of Texts
Common Elements
Class 2 Validation
Class 2 Metadata (<head>)
Class 2 Data Patterns (<body>)
@pos and @val
Alignments: Principles and Assumptions
Division-Based Alignments (<TAN-A-div>)
Root Element and Header
Data (<body>)
Token-Based Alignments (<TAN-A-tok>)
Root Element and Header
Data (<body>)
Lexico-Morphology
Principles and Assumptions
Root Element and Header
Data (<body>)
7. Class-3 TAN Files, Varia
Keyword Vocabulary (TAN-key)
Root Element and Head
Data (<body>)
Morphological Concepts and Patterns (TAN-mor)
Principles and Assumptions
Root Element and Header
Data (<body>)
Claims and assertions (TAN-c)
Root Element and Header
Data (<body>)
8. TAN patterns, elements, and attributes defined
TAN-core elements and attributes summarized
<agent>
<agentrole>
<alias>
<body>
<change>
<checksum>
<comment>
<declarations>
<desc>
<for-lang>
<group>
<group-type>
<head>
<inclusion>
<IRI>
<key>
<location>
<master-location>
<name>
<relationship>
<rights-excluding-sources>
<rights-source-only>
<role>
<see-also>
<source>
<tail>
<token-definition>
<value>
<version>
<when>
<work>
@affects-element
@cert
@cert2
@ed-when
@ed-who
@flags
@from
@group
@help
@href
@id
@idrefs
@in-progress
@include
@n
@regex
@rights-holder
@roles
@TAN-version
@to
@type
@when
@when-accessed
@which
@who
@xml:id
@xml:lang
TAN-class-1 elements and attributes summarized
<div-type>
<filter>
<normalization>
<replace>
<transliteration>
@replacement
TAN-T elements and attributes summarized
<div>
<TAN-T>
TAN-class-2 elements and attributes summarized
<rename>
<rename-div-ns>
<suppress-div-types>
<tok>
@chars
@cont
@div-type-ref
@new
@old
@pos
@ref
@src
@val
TAN-A-div elements and attributes summarized
<anchor-div-ref>
<div-ref>
<div-type-ref>
<equate-div-types>
<equate-works>
<realign>
<split-leaf-div-at>
<TAN-A-div>
@seg
@work
TAN-A-tok elements and attributes summarized
<align>
<bitext-relation>
<reuse-type>
<TAN-A-tok>
@bitext-relation
@reuse-type
TAN-LM-core elements and attributes summarized
<ana>
<l>
<lexicon>
<lm>
<m>
<morphology>
<TAN-LM>
@def-ref
@lexicon
@morphology
TAN-LM elements and attributes summarized
TAN-LM-lang elements and attributes summarized
TAN-class-3 elements and attributes summarized
TAN-key elements and attributes summarized
<item>
<TAN-key>
TAN-mor elements and attributes summarized
<assert>
<category>
<feature>
<report>
<TAN-mor>
@code
@context
@feature-qty-test
@feature-test
@matches-m
@matches-tok
TAN-c elements and attributes summarized
<TAN-c>
TAN-c-core elements and attributes summarized
<claim>
<claim-basis>
<locus>
<modal>
<object>
<person>
<place>
<scriptum>
<subject>
<topic>
<unit>
<verb>
@adverb
@claim-basis
@claimant
@object
@object-datatype
@object-lexical-constraint
@subject
@units
@verb
@where
TAN patterns
~agent-list
~agent-ref
~agent-role-list
~alignment
~alignment-attributes-non-class-2
~alignment-content-non-class-2
~alignment-inclusion-opt
~anchor-div-ref-item
~any-attribute
~any-content
~any-element
~assert
~attr-cert
~attr-cert2
~bitext-relation-attr
~body-group
~body-group-opt
~category
~category-feature
~category-list
~cert-claim
~cert-content
~cert-opt
~certainty-stamp
~change-list
~char-ref
~checksum
~claim
~claim-div-ref-item
~claimant
~code
~comment
~complex-object
~complex-rationale
~complex-subject
~complex-text-ref
~complex-textual-reference-set
~continuation
~continuation-opt
~decl-alias
~decl-brel
~decl-class-1
~decl-div
~decl-filt
~decl-filt-norm
~decl-filt-repl
~decl-filt-tlit
~decl-filter-content
~decl-group-type
~decl-id-ref-opt
~decl-lexi
~decl-mode
~decl-morph
~decl-non-class-1
~decl-opt
~decl-pattern-default
~decl-pattern-language
~decl-pattern-no-id
~decl-pers
~decl-place
~decl-rename-div-n
~decl-reus
~decl-scri
~decl-supp-div-type
~decl-tok-def
~decl-topic
~decl-unit
~decl-verb
~decl-vers
~decl-work
~declaration-core
~declaration-items
~div-item-ref
~div-range-ref
~div-type-equiv
~div-type-ref
~div-type-ref-cluster
~ed-agent
~ed-stamp
~ed-time
~element-scope
~entity-digital-generic-ref
~entity-digital-tan-other-ref
~entity-digital-tan-self-ref
~entity-nondigital-ref
~entity-tok-def
~error-flag
~feature
~feature-list
~feature-pattern
~feature-pattern-no-code
~feature-qty-test
~feature-test
~filter
~func-param-flags
~func-param-pattern
~func-replace
~grammar-attr
~group-attributes
~group-ref
~help-opt
~href-opt
~id-option
~inclusion
~inclusion-att
~inclusion-item
~inclusion-list
~internal-id
~internal-idrefs
~IRI-gen
~IRI-gen-ref
~item
~item-picker
~item-pos-ref
~key-item
~key-list
~keyword-ref
~lang-of-content
~lang-outside
~lexeme
~lexicon-attr
~loc-self
~loc-src
~locus
~matches-m
~matches-tok
~metadata-desc
~metadata-human
~modal-claim
~morph
~n
~n-val
~name-change
~non-class-2-opt
~nonsource-rights
~nontextual-reference
~object
~object-constraint
~object-datatype
~object-element
~object-lexical-constraint
~other-body-attributes
~period-filter
~place-filter
~pointer-to-div-item
~pointer-to-div-range
~progress
~rationale
~realignment
~reanchor-div-ref-item
~relationship
~report
~reuse-type-attr
~rights-holder
~role-list
~role-ref
~see-also-item
~see-also-list
~seg-ref
~seq-picker
~seq-pos-ref
~set-of-claims
~simple-object
~simple-rationale
~simple-subject
~simple-textual-reference
~source-id-opt
~source-item
~source-list
~source-ref
~source-refs
~source-rights
~split
~subject
~TAN-body
~TAN-body-core
~TAN-c-decl
~TAN-c-decl-core
~TAN-c-item
~TAN-head
~TAN-key-decl
~TAN-key-item
~TAN-LM-item
~TAN-R-mor-body
~TAN-root
~TAN-tail
~TAN-ver
~test-pattern
~text-div
~textual-reference
~tok-attr-core
~tok-cert-opt
~tok-regular
~tok-sequence
~tok-sequence-attr-core
~tok-source-ref-opt
~tok-with-cont-but-no-src
~tok-with-src-and-cont
~tok-without-cont-or-src
~token-value-ref
~type
~units
~URI-tag
~verb
~when-claim
~work-equiv
~work-ref
~work-refs
9. Official TAN keywords
TAN keywords for types of bitext relations (<bitext-relation>)
TAN keywords for types of divisions (<div-type>)
TAN keywords for features (<feature>)
TAN keywords for types of groups (<group-type>)
TAN keywords for types of modals (<modal>)
TAN keywords for types of normalizations (<normalization>)
TAN keywords for types of relationships (<relationship>)
TAN keywords for types of bitext reuse (<reuse-type>)
TAN keywords for types of rights (<rights-excluding-sources><rights-source-only>)
TAN keywords for types of roles (<role>)
TAN keywords for types of token definitions (<token-definition>)
TAN keywords for verbs (<verb>)
III. Working with the Text Alignment Network
10. Best Practices in Working with TAN Files
Local Setup
Creating and maintaining TAN collections
Creating and editing TAN files
Sharing TAN files
Doing Things with TAN Files (Stylesheets and the Function Library)
11. TAN variables, keys, functions, and templates
TAN-core global variables, keys, and functions summarized
variables
keys
functions
TAN-core-errors global variables, keys, and functions summarized
variables
functions
TAN-class-1 global variables, keys, and functions summarized
variables
functions
TAN-class-2 global variables, keys, and functions summarized
variables
keys
functions
templates
TAN-A-div global variables, keys, and functions summarized
variables
functions
TAN-A-tok global variables, keys, and functions summarized
functions
TAN-LM global variables, keys, and functions summarized
variables
functions
TAN-class-2-errors global variables, keys, and functions summarized
functions
TAN-class-1-and-2 global variables, keys, and functions summarized
variables
functions
TAN-key global variables, keys, and functions summarized
variables
TAN-class-2-and-3 global variables, keys, and functions summarized
functions
diff-for-xslt2 global variables, keys, and functions summarized
functions
regex-ext-tan global variables, keys, and functions summarized
variables
functions
templates
TAN-schema global variables, keys, and functions summarized
variables
functions
Mode templates
ŧ #all
ŧ add-lm-to-tok
ŧ add-square-brackets
ŧ add-tok-val
ŧ analysis-stamp
ŧ analyze-ref
ŧ arabic-numerals
ŧ c1-add-ref
ŧ c1-stamp-string-length
ŧ c1-stamp-string-pos
ŧ char-setup
ŧ class-1-copy-errors
ŧ class-1-errors
ŧ class-2-errors
ŧ compare-copies
ŧ convert-code-to-features
ŧ copy-of-except
ŧ core-attribute-errors
ŧ core-errors
ŧ count-tokenized-class-1
ŧ count-tokens
ŧ cull-prepped-class-1
ŧ diff-rectify
ŧ drop-tokenization
ŧ expand-lm
ŧ first-stamp
ŧ fragment-to-text
ŧ get-div-hierarchy-fragment
ŧ get-mismatched-text
ŧ include
ŧ infuse-tokenized-div
ŧ infuse-tokenized-text
ŧ insert-seg-into-leaf-divs-in-hierarchy-fragment
ŧ mark-splits
ŧ mark-splits-in-fragment
ŧ mark-tok-chars
ŧ normalize-space
ŧ pick-prepped-class-1
ŧ pluck
ŧ prep-class-1
ŧ prep-class-2-doc-pass-1
ŧ prep-class-2-doc-pass-2
ŧ prep-class-2-doc-pass-3
ŧ prep-class-2-doc-pass-3-old
ŧ prep-class-2-doc-pass-4
ŧ prep-rim-pass-1
ŧ prep-rim-pass-2
ŧ prep-srcs-verbosely
ŧ prep-tan-a-div-pass-3-prelim
ŧ prep-tan-a-div-pass-a
ŧ prep-tan-a-div-pass-b
ŧ prep-tan-claims
ŧ prep-tan-key
ŧ prep-tan-lm
ŧ prep-tan-mor
ŧ prep-verbosely
ŧ prepare-class-1-doc-for-merge
ŧ prepend-id-or-idrefs
ŧ process-splits
ŧ realign-tan-a-div-sources
ŧ referenced-doc-errors
ŧ resolve-attr-include
ŧ resolve-href
ŧ resolve-keyword
ŧ segment-tokd-prepped-class-1
ŧ snap-to-word-pass-1
ŧ split-marked-fragment
ŧ stamp-element-id
ŧ strip-all-attributes-except
ŧ strip-duplicates
ŧ strip-specific-attributes
ŧ strip-text
ŧ synthesize-merged-sources
ŧ TAN-A-div-errors
ŧ tan-a-div-merge-pass1
ŧ tan-key-errors
ŧ tokenize-prepped-class-1
ŧ unconsolidate-anas
Cross-format global variables
$self-and-sources-prepped
$self-prepped
$sources-prepped
Cross-format functions
tan:prep-resolved-class-1-doc()
12. Errors
error[adv01]
error[adv02]
error[adv03]
error[ali01]
error[cl101]
error[cl102]
error[cl103]
error[cl104]
error[cl105]
error[cl106]
warning[cl107]
error[cl108]
error[cl109]
error[cl110]
error[cl111]
error[cl112]
error[cl113]
error[cl114]
warning[cl115]
warning[cl116]
error[cl117]
fatal[cl201]
error[cl202]
error[cl203]
error[cl204]
error[cl208]
error[cl209]
warning[cl210]
error[cl211]
error[cl212]
error[clm01]
error[clm02]
error[clm03]
error[clm04]
error[clm05]
error[clm06]
error[clm07]
error[dst01]
error[dty01]
error[equ01]
error[inc01]
error[inc02]
error[inc03]
fatal[inc04]
error[inc05]
error[loc01]
error[loc02]
error[loc03]
error[rea01]
error[ref01]
error[ref02]
warning[ref03]
error[ref04]
error[see01]
error[see02]
error[see03]
error[see04]
error[seg01]
error[seq01]
error[seq02]
error[seq03]
error[spl01]
error[spl02]
error[spl03]
error[tan01]
warning[tan02]
error[tan03]
error[tan04]
error[tan05]
error[tan06]
error[tan07]
error[tan08]
error[tan09]
error[tan10]
error[tan11]
error[tan12]
error[tan13]
error[tan14]
error[tan15]
error[tan16]
error[tei01]
error[tei02]
error[tei03]
warning[tei04]
error[tei05]
error[tky01]
error[tky02]
error[tky03]
error[tky04]
error[tlm01]
error[tlm02]
error[tlm03]
error[tlm04]
warning[tlm05]
error[tmo01]
error[tmo02]
error[tmo03]
error[tok01]
error[tok02]
error[whe01]
error[whe02]
error[whe03]
error[whi01]
error[whi02]
error[whi03]
fatal[whi04]
warning[wrn01]
warning[wrn02]
warning[wrn03]
warning[wrn04]
warning[wrn05]

List of Figures

3.1. Venn%20diagram.jpeg

List of Tables

2.1. Ring around the Rosie
3.1. Unicode characters
3.2. Special characters in regular expressions
3.3. Examples of Regular Expressions
4.1. Root TAN elements
5.1. Synopsis of TAN-TEI customization
9.1. TAN keywords for types of bitext relations
9.2. TAN keywords for types of divisions
9.3. TAN keywords for features
9.4. TAN keywords for types of groups
9.5. TAN keywords for types of modals
9.6. TAN keywords for types of normalizations
9.7. TAN keywords for types of relationships
9.8. TAN keywords for types of bitext reuse
9.9. TAN keywords for types of rights
9.10. TAN keywords for types of roles
9.11. TAN keywords for types of token definitions
9.12. TAN keywords for verbs
10.1. Global variables for referred files

List of Examples

3.1. TAN IRI names
8.1. <agent>
8.2. <agent>
8.3. <agent>
8.4. <agent>
8.5. <alias>
8.6. <body>
8.7. <body>
8.8. <body>
8.9. <body>
8.10. <change>
8.11. <change>
8.12. <change>
8.13. <checksum>
8.14. <comment>
8.15. <comment>
8.16. <comment>
8.17. <comment>
8.18. <declarations>
8.19. <declarations>
8.20. <declarations>
8.21. <declarations>
8.22. <desc>
8.23. <desc>
8.24. <desc>
8.25. <for-lang>
8.26. <group>
8.27. <group-type>
8.28. <group-type>
8.29. <head>
8.30. <head>
8.31. <head>
8.32. <head>
8.33. <inclusion>
8.34. <IRI>
8.35. <key>
8.36. <location>
8.37. <location>
8.38. <master-location>
8.39. <master-location>
8.40. <master-location>
8.41. <master-location>
8.42. <name>
8.43. <relationship>
8.44. <relationship>
8.45. <rights-excluding-sources>
8.46. <rights-excluding-sources>
8.47. <rights-excluding-sources>
8.48. <rights-excluding-sources>
8.49. <role>
8.50. <role>
8.51. <role>
8.52. <role>
8.53. <see-also>
8.54. <see-also>
8.55. <source>
8.56. <source>
8.57. <source>
8.58. <source>
8.59. <token-definition>
8.60. <token-definition>
8.61. <token-definition>
8.62. <token-definition>
8.63. <value>
8.64. <version>
8.65. <work>
8.66. <work>
8.67. <work>
8.68. <work>
8.69. @affects-element
8.70. @affects-element
8.71. @affects-element
8.72. @affects-element
8.73. @cert
8.74. @cert
8.75. @ed-when
8.76. @ed-when
8.77. @ed-when
8.78. @ed-who
8.79. @ed-who
8.80. @ed-who
8.81. @group
8.82. @href
8.83. @href
8.84. @id
8.85. @id
8.86. @id
8.87. @id
8.88. @idrefs
8.89. @in-progress
8.90. @in-progress
8.91. @in-progress
8.92. @in-progress
8.93. @include
8.94. @n
8.95. @regex
8.96. @regex
8.97. @regex
8.98. @regex
8.99. @rights-holder
8.100. @rights-holder
8.101. @rights-holder
8.102. @rights-holder
8.103. @roles
8.104. @roles
8.105. @roles
8.106. @roles
8.107. @TAN-version
8.108. @TAN-version
8.109. @TAN-version
8.110. @TAN-version
8.111. @type
8.112. @when
8.113. @when
8.114. @when
8.115. @when-accessed
8.116. @when-accessed
8.117. @which
8.118. @which
8.119. @who
8.120. @who
8.121. @who
8.122. @xml:id
8.123. @xml:lang
8.124. @xml:lang
8.125. <div-type>
8.126. <div-type>
8.127. <filter>
8.128. <filter>
8.129. <filter>
8.130. <filter>
8.131. <normalization>
8.132. <normalization>
8.133. <normalization>
8.134. <normalization>
8.135. <div>
8.136. <TAN-T>
8.137. <TAN-T>
8.138. <TAN-T>
8.139. <TAN-T>
8.140. <rename>
8.141. <rename>
8.142. <rename-div-ns>
8.143. <rename-div-ns>
8.144. <suppress-div-types>
8.145. <suppress-div-types>
8.146. <tok>
8.147. @div-type-ref
8.148. @div-type-ref
8.149. @new
8.150. @new
8.151. @old
8.152. @old
8.153. @pos
8.154. @ref
8.155. @src
8.156. @val
8.157. <anchor-div-ref>
8.158. <div-ref>
8.159. <div-type-ref>
8.160. <equate-div-types>
8.161. <equate-works>
8.162. <realign>
8.163. <split-leaf-div-at>
8.164. <split-leaf-div-at>
8.165. <TAN-A-div>
8.166. <TAN-A-div>
8.167. <TAN-A-div>
8.168. @seg
8.169. @work
8.170. <align>
8.171. <bitext-relation>
8.172. <bitext-relation>
8.173. <bitext-relation>
8.174. <reuse-type>
8.175. <reuse-type>
8.176. <reuse-type>
8.177. <TAN-A-tok>
8.178. <TAN-A-tok>
8.179. <TAN-A-tok>
8.180. @bitext-relation
8.181. @bitext-relation
8.182. @bitext-relation
8.183. @reuse-type
8.184. @reuse-type
8.185. @reuse-type
8.186. <ana>
8.187. <l>
8.188. <lexicon>
8.189. <lexicon>
8.190. <lm>
8.191. <morphology>
8.192. <morphology>
8.193. <TAN-LM>
8.194. <TAN-LM>
8.195. @lexicon
8.196. @lexicon
8.197. @morphology
8.198. @morphology
8.199. <item>
8.200. <TAN-key>
8.201. <TAN-key>
8.202. <TAN-key>
8.203. <TAN-key>
8.204. <feature>
8.205. <TAN-mor>
8.206. @code
8.207. <locus>
8.208. <modal>
8.209. <object>
8.210. <person>
8.211. <scriptum>
8.212. <verb>
8.213. @adverb
8.214. @claim-basis
8.215. @claimant
8.216. @object-datatype