The Text Alignment Network: Official Guidelines

Revision History
Revision 1 dev2017-03-14

Prepublication working draft, for circulation among those interested in helping to develop TAN. Corrections are actively solicited and should be sent to the author (see above).

Formats: HTMLPDFDocbook (master)

[Warning]Warning

This version of the guidelines is known to have significant omissions. In case of contradictions, apparent or not, between the schemas and these guidelines, the greatest weight is to be given to the RELAX-NG schemas (compact syntax), then to the functions, and then to these guidelines.

Chapters 1-7 and 10 are written by hand, and are relatively accurate. Chapters 8, 9, 11, and 12 are written by an algorithm that selectively reformats normative TAN files. Most errors or inconsistencies found in those chapters will be based in the stylesheets that produces them.

TAN is meant to encourage collaboration and distributed scholarship. If you are part of a project that would like to use and help develop the TAN format, please contact the author (see above).


Table of Contents

I. General Overview
1. Introduction
Definition and purpose
Rationale and purpose
Design Principles
2. Starting off with the TAN Format
Creating TAN Transcription and Alignment Data
The Principles of TAN Metadata (<head>)
Creating TAN Metadata (<head>)
Aligning across Projects
II. Detailed Description
3. General Underpinnings
The Big Picture
Assumptions in the Creation of TAN Data
Core Technology
Unicode
eXtensible Markup Language (XML)
Namespaces
The Text Encoding Initiative
Schemas, Validation, and Compatability
Data types
Identifiers and Their Use
Regular Expressions
Interpretation of multiple values
4. Patterns and Structures Common to All TAN Encoding Formats
Common Patterns
IRI + name Pattern
Digital Entity Metadata Pattern
Edit Stamp
Overall Structure (root)
@id and a TAN file's IRI Name
Metadata (<head>)
Rights and Licenses
Inclusions and Keys
Distinguishing <source>s and <see-also>s
Interpretation of inheritable attributes
Defining Words and Tokens
5. Class-1 TAN Files, Representations of Textual Objects (Scripta)
Principles and Assumptions
General
Domain model
One version, one work, one object, one reference system
Normalizing transcriptions
Transcriptions
Flattened References, and the Leaf Div Uniqueness Rule
Transcriptions Using the Text Encoding Initiative (<TEI>)
6. Class-2 TAN Files, Annotations of Texts
Common Elements
Class 2 Validation
Class 2 Metadata (<head>)
Class 2 Data Patterns (<body>)
@pos and @val
Alignments: Principles and Assumptions
Division-Based Alignments (<TAN-A-div>)
Root Element and Header
Data (<body>)
Token-Based Alignments (<TAN-A-tok>)
Root Element and Header
Data (<body>)
Lexico-Morphology
Principles and Assumptions
Root Element and Header
Data (<body>)
7. Class-3 TAN Files, Varia
Keyword Vocabulary (TAN-key)
Root Element and Head
Data (<body>)
Morphological Concepts and Patterns (TAN-mor)
Principles and Assumptions
Root Element and Header
Data (<body>)
Claims and assertions (TAN-c)
Root Element and Header
Data (<body>)
8. TAN patterns, elements, and attributes defined
TAN-core elements and attributes summarized
<agent>
<agentrole>
<alias>
<body>
<change>
<checksum>
<comment>
<declarations>
<desc>
<for-lang>
<group>
<group-type>
<head>
<inclusion>
<IRI>
<key>
<location>
<master-location>
<name>
<relationship>
<rights-excluding-sources>
<rights-source-only>
<role>
<see-also>
<source>
<tail>
<token-definition>
<value>
<version>
<when>
<work>
@affects-element
@cert
@cert2
@ed-when
@ed-who
@flags
@from
@group
@help
@href
@id
@idrefs
@in-progress
@include
@n
@regex
@rights-holder
@roles
@TAN-version
@to
@type
@when
@when-accessed
@which
@who
@xml:id
@xml:lang
TAN-class-1 elements and attributes summarized
<div-type>
<filter>
<normalization>
<replace>
<transliteration>
@replacement
TAN-T elements and attributes summarized
<div>
<TAN-T>
TAN-class-2 elements and attributes summarized
<rename>
<rename-div-ns>
<suppress-div-types>
<tok>
@chars
@cont
@div-type-ref
@new
@old
@pos
@ref
@src
@val
TAN-A-div elements and attributes summarized
<anchor-div-ref>
<div-ref>
<div-type-ref>
<equate-div-types>
<equate-works>
<realign>
<split-leaf-div-at>
<TAN-A-div>
@seg
@work
TAN-A-tok elements and attributes summarized
<align>
<bitext-relation>
<reuse-type>
<TAN-A-tok>
@bitext-relation
@reuse-type
TAN-LM-core elements and attributes summarized
<ana>
<l>
<lexicon>
<lm>
<m>
<morphology>
<TAN-LM>
@def-ref
@lexicon
@morphology
TAN-LM elements and attributes summarized
TAN-LM-lang elements and attributes summarized
TAN-class-3 elements and attributes summarized
TAN-key elements and attributes summarized
<item>
<TAN-key>
TAN-mor elements and attributes summarized
<assert>
<category>
<feature>
<report>
<TAN-mor>
@code
@context
@feature-qty-test
@feature-test
@matches-m
@matches-tok
TAN-c elements and attributes summarized
<TAN-c>
TAN-c-core elements and attributes summarized
<claim>
<claim-basis>
<locus>
<modal>
<object>
<person>
<place>
<scriptum>
<subject>
<topic>
<unit>
<verb>
@adverb
@claim-basis
@claimant
@object
@object-datatype
@object-lexical-constraint
@subject
@units
@verb
@where
TAN patterns
~agent-list
~agent-ref
~agent-role-list
~alignment
~alignment-attributes-non-class-2
~alignment-content-non-class-2
~alignment-inclusion-opt
~anchor-div-ref-item
~any-attribute
~any-content
~any-element
~assert
~attr-cert
~attr-cert2
~bitext-relation-attr
~body-group
~body-group-opt
~category
~category-feature
~category-list
~cert-claim
~cert-content
~cert-opt
~certainty-stamp
~change-list
~char-ref
~checksum
~claim
~claim-div-ref-item
~claimant
~code
~comment
~complex-object
~complex-rationale
~complex-subject
~complex-text-ref
~complex-textual-reference-set
~continuation
~continuation-opt
~decl-alias
~decl-brel
~decl-class-1
~decl-div
~decl-filt
~decl-filt-norm
~decl-filt-repl
~decl-filt-tlit
~decl-filter-content
~decl-group-type
~decl-id-ref-opt
~decl-lexi
~decl-mode
~decl-morph
~decl-non-class-1
~decl-opt
~decl-pattern-default
~decl-pattern-language
~decl-pattern-no-id
~decl-pers
~decl-place
~decl-rename-div-n
~decl-reus
~decl-scri
~decl-supp-div-type
~decl-tok-def
~decl-topic
~decl-unit
~decl-verb
~decl-vers
~decl-work
~declaration-core
~declaration-items
~div-item-ref
~div-range-ref
~div-type-equiv
~div-type-ref
~div-type-ref-cluster
~ed-agent
~ed-stamp
~ed-time
~element-scope
~entity-digital-generic-ref
~entity-digital-tan-other-ref
~entity-digital-tan-self-ref
~entity-nondigital-ref
~entity-tok-def
~error-flag
~feature
~feature-list
~feature-pattern
~feature-pattern-no-code
~feature-qty-test
~feature-test
~filter
~func-param-flags
~func-param-pattern
~func-replace
~grammar-attr
~group-attributes
~group-ref
~help-opt
~href-opt
~id-option
~inclusion
~inclusion-att
~inclusion-item
~inclusion-list
~internal-id
~internal-idrefs
~IRI-gen
~IRI-gen-ref
~item
~item-picker
~item-pos-ref
~key-item
~key-list
~keyword-ref
~lang-of-content
~lang-outside
~lexeme
~lexicon-attr
~loc-self
~loc-src
~locus
~matches-m
~matches-tok
~metadata-desc
~metadata-human
~modal-claim
~morph
~n
~n-val
~name-change
~non-class-2-opt
~nonsource-rights
~nontextual-reference
~object
~object-constraint
~object-datatype
~object-element
~object-lexical-constraint
~other-body-attributes
~period-filter
~place-filter
~pointer-to-div-item
~pointer-to-div-range
~progress
~rationale
~realignment
~reanchor-div-ref-item
~relationship
~report
~reuse-type-attr
~rights-holder
~role-list
~role-ref
~see-also-item
~see-also-list
~seg-ref
~seq-picker
~seq-pos-ref
~set-of-claims
~simple-object
~simple-rationale
~simple-subject
~simple-textual-reference
~source-id-opt
~source-item
~source-list
~source-ref
~source-refs
~source-rights
~split
~subject
~TAN-body
~TAN-body-core
~TAN-c-decl
~TAN-c-decl-core
~TAN-c-item
~TAN-head
~TAN-key-decl
~TAN-key-item
~TAN-LM-item
~TAN-R-mor-body
~TAN-root
~TAN-tail
~TAN-ver
~test-pattern
~text-div
~textual-reference
~tok-attr-core
~tok-cert-opt
~tok-regular
~tok-sequence
~tok-sequence-attr-core
~tok-source-ref-opt
~tok-with-cont-but-no-src
~tok-with-src-and-cont
~tok-without-cont-or-src
~token-value-ref
~type
~units
~URI-tag
~verb
~when-claim
~work-equiv
~work-ref
~work-refs
9. Official TAN keywords
TAN keywords for types of bitext relations (<bitext-relation>)
TAN keywords for types of divisions (<div-type>)
TAN keywords for features (<feature>)
TAN keywords for types of groups (<group-type>)
TAN keywords for types of modals (<modal>)
TAN keywords for types of normalizations (<normalization>)
TAN keywords for types of relationships (<relationship>)
TAN keywords for types of bitext reuse (<reuse-type>)
TAN keywords for types of rights (<rights-excluding-sources><rights-source-only>)
TAN keywords for types of roles (<role>)
TAN keywords for types of token definitions (<token-definition>)
TAN keywords for verbs (<verb>)
III. Working with the Text Alignment Network
10. Best Practices in Working with TAN Files
File Setup
Creating and Editing TAN Files
Sharing TAN files
Doing Things with TAN Files (Stylesheets and the Function Library)
11. TAN variables, keys, functions, and templates
TAN-core global variables, keys, and functions summarized
variables
keys
functions
TAN-core-errors global variables, keys, and functions summarized
variables
functions
TAN-class-1 global variables, keys, and functions summarized
variables
functions
TAN-class-2 global variables, keys, and functions summarized
variables
keys
functions
templates
TAN-A-div global variables, keys, and functions summarized
variables
functions
TAN-A-tok global variables, keys, and functions summarized
functions
TAN-LM global variables, keys, and functions summarized
variables
functions
TAN-class-2-errors global variables, keys, and functions summarized
functions
TAN-class-1-and-2 global variables, keys, and functions summarized
variables
functions
TAN-key global variables, keys, and functions summarized
variables
TAN-class-2-and-3 global variables, keys, and functions summarized
functions
diff-for-xslt2 global variables, keys, and functions summarized
functions
TAN-schema global variables, keys, and functions summarized
variables
functions
Mode templates
ŧ #all
ŧ add-lm-to-tok
ŧ add-tok-val
ŧ analysis-stamp
ŧ analyze-ref
ŧ arabic-numerals
ŧ c1-add-ref
ŧ c1-stamp-string-length
ŧ c1-stamp-string-pos
ŧ char-setup
ŧ class-1-copy-errors
ŧ class-1-errors
ŧ class-2-errors
ŧ compare-copies
ŧ convert-code-to-features
ŧ copy-of-except
ŧ core-attribute-errors
ŧ core-errors
ŧ count-tokenized-class-1
ŧ count-tokens
ŧ cull-prepped-class-1
ŧ diff-rectify
ŧ drop-tokenization
ŧ expand-lm
ŧ first-stamp
ŧ get-div-hierarchy-fragment
ŧ get-mismatched-text
ŧ include
ŧ infuse-tokenized-div
ŧ infuse-tokenized-text
ŧ insert-seg-into-leaf-divs-in-hierarchy-fragment
ŧ mark-splits
ŧ mark-splits-in-fragment
ŧ mark-tok-chars
ŧ normalize-space
ŧ pick-prepped-class-1
ŧ pluck
ŧ prep-class-1
ŧ prep-class-2-doc-pass-1
ŧ prep-class-2-doc-pass-2
ŧ prep-class-2-doc-pass-3
ŧ prep-class-2-doc-pass-3-old
ŧ prep-class-2-doc-pass-4
ŧ prep-rim-pass-1
ŧ prep-rim-pass-2
ŧ prep-srcs-verbosely
ŧ prep-tan-a-div-pass-3-prelim
ŧ prep-tan-a-div-pass-a
ŧ prep-tan-a-div-pass-b
ŧ prep-tan-claims
ŧ prep-tan-key
ŧ prep-tan-lm
ŧ prep-tan-mor
ŧ prep-verbosely
ŧ prepare-class-1-doc-for-merge
ŧ prepend-id-or-idrefs
ŧ process-splits
ŧ realign-tan-a-div-sources
ŧ referenced-doc-errors
ŧ resolve-attr-include
ŧ resolve-href
ŧ resolve-keyword
ŧ segment-tokd-prepped-class-1
ŧ split-marked-fragment
ŧ stamp-element-id
ŧ strip-all-attributes-except
ŧ strip-duplicates
ŧ strip-specific-attributes
ŧ strip-text
ŧ synthesize-merged-sources
ŧ TAN-A-div-errors
ŧ tan-a-div-merge-pass1
ŧ tan-key-errors
ŧ tokenize-prepped-class-1
ŧ unconsolidate-anas
Cross-format global variables
$self-and-sources-prepped
$self-prepped
$sources-prepped
Cross-format functions
tan:prep-resolved-class-1-doc()
12. Errors
error[adv01]
error[adv02]
error[adv03]
error[ali01]
error[cl101]
error[cl102]
error[cl103]
error[cl104]
error[cl105]
error[cl106]
warning[cl107]
error[cl108]
error[cl109]
error[cl110]
error[cl111]
error[cl112]
error[cl113]
error[cl114]
fatal[cl201]
error[cl202]
error[cl203]
error[cl204]
error[cl208]
error[cl209]
warning[cl210]
error[cl211]
error[cl212]
error[clm01]
error[clm02]
error[clm03]
error[clm04]
error[clm05]
error[clm06]
error[clm07]
error[dst01]
error[dty01]
error[equ01]
error[inc01]
error[inc02]
error[inc03]
fatal[inc04]
error[inc05]
error[loc01]
error[loc02]
error[loc03]
error[rea01]
error[ref01]
error[ref02]
warning[ref03]
error[ref04]
error[see01]
error[see02]
error[see03]
error[see04]
error[seg01]
error[seq01]
error[seq02]
error[seq03]
error[spl01]
error[spl02]
error[spl03]
error[tan01]
error[tan02]
error[tan03]
error[tan04]
error[tan05]
error[tan06]
error[tan07]
error[tan08]
error[tan09]
error[tan10]
error[tan11]
error[tan12]
error[tan13]
error[tan14]
error[tan15]
error[tei01]
error[tky01]
error[tky02]
error[tky03]
error[tky04]
error[tlm01]
error[tlm02]
error[tlm03]
error[tlm04]
warning[tlm05]
error[tmo01]
error[tmo02]
error[tmo03]
error[tok01]
error[tok02]
error[whe01]
error[whe02]
error[whe03]
error[whi01]
error[whi02]
error[whi03]
fatal[whi04]
warning[wrn01]
warning[wrn02]
warning[wrn03]
warning[wrn04]
warning[wrn05]

List of Figures

3.1. Venn%20diagram.jpeg

List of Tables

2.1. Ring around the Rosie
3.1. Unicode characters
3.2. Locations of master schemas
3.3. Special characters in regular expressions
3.4. Examples of Regular Expressions
4.1. Root TAN elements
5.1. Synopsis of TAN-TEI customization
9.1. TAN keywords for types of bitext relations
9.2. TAN keywords for types of divisions
9.3. TAN keywords for features
9.4. TAN keywords for types of groups
9.5. TAN keywords for types of modals
9.6. TAN keywords for types of normalizations
9.7. TAN keywords for types of relationships
9.8. TAN keywords for types of bitext reuse
9.9. TAN keywords for types of rights
9.10. TAN keywords for types of roles
9.11. TAN keywords for types of token definitions
9.12. TAN keywords for verbs
10.1. Global variables for referred files

List of Examples

3.1. TAN IRI names
8.1. <agent>
8.2. <agent>
8.3. <agent>
8.4. <agent>
8.5. <alias>
8.6. <body>
8.7. <body>
8.8. <body>
8.9. <body>
8.10. <change>
8.11. <change>
8.12. <change>
8.13. <checksum>
8.14. <comment>
8.15. <comment>
8.16. <comment>
8.17. <comment>
8.18. <declarations>
8.19. <declarations>
8.20. <declarations>
8.21. <declarations>
8.22. <desc>
8.23. <desc>
8.24. <desc>
8.25. <for-lang>
8.26. <group>
8.27. <group-type>
8.28. <group-type>
8.29. <group-type>
8.30. <head>
8.31. <head>
8.32. <head>
8.33. <head>
8.34. <inclusion>
8.35. <IRI>
8.36. <key>
8.37. <location>
8.38. <location>
8.39. <master-location>
8.40. <master-location>
8.41. <master-location>
8.42. <master-location>
8.43. <name>
8.44. <relationship>
8.45. <relationship>
8.46. <rights-excluding-sources>
8.47. <rights-excluding-sources>
8.48. <rights-excluding-sources>
8.49. <rights-excluding-sources>
8.50. <role>
8.51. <role>
8.52. <role>
8.53. <role>
8.54. <see-also>
8.55. <see-also>
8.56. <source>
8.57. <source>
8.58. <source>
8.59. <source>
8.60. <token-definition>
8.61. <token-definition>
8.62. <token-definition>
8.63. <token-definition>
8.64. <value>
8.65. <version>
8.66. <work>
8.67. <work>
8.68. <work>
8.69. <work>
8.70. @affects-element
8.71. @affects-element
8.72. @affects-element
8.73. @affects-element
8.74. @cert
8.75. @cert
8.76. @ed-when
8.77. @ed-when
8.78. @ed-when
8.79. @ed-who
8.80. @ed-who
8.81. @ed-who
8.82. @group
8.83. @href
8.84. @href
8.85. @id
8.86. @id
8.87. @id
8.88. @id
8.89. @idrefs
8.90. @in-progress
8.91. @in-progress
8.92. @in-progress
8.93. @in-progress
8.94. @include
8.95. @n
8.96. @regex
8.97. @regex
8.98. @regex
8.99. @regex
8.100. @rights-holder
8.101. @rights-holder
8.102. @rights-holder
8.103. @rights-holder
8.104. @roles
8.105. @roles
8.106. @roles
8.107. @roles
8.108. @TAN-version
8.109. @TAN-version
8.110. @TAN-version
8.111. @TAN-version
8.112. @type
8.113. @when
8.114. @when
8.115. @when
8.116. @when-accessed
8.117. @when-accessed
8.118. @which
8.119. @which
8.120. @who
8.121. @who
8.122. @who
8.123. @xml:id
8.124. @xml:lang
8.125. @xml:lang
8.126. <div-type>
8.127. <div-type>
8.128. <filter>
8.129. <filter>
8.130. <filter>
8.131. <filter>
8.132. <normalization>
8.133. <normalization>
8.134. <normalization>
8.135. <normalization>
8.136. <div>
8.137. <TAN-T>
8.138. <TAN-T>
8.139. <TAN-T>
8.140. <TAN-T>
8.141. <rename>
8.142. <rename>
8.143. <rename-div-ns>
8.144. <rename-div-ns>
8.145. <suppress-div-types>
8.146. <suppress-div-types>
8.147. <tok>
8.148. @chars
8.149. @div-type-ref
8.150. @div-type-ref
8.151. @new
8.152. @new
8.153. @old
8.154. @old
8.155. @pos
8.156. @ref
8.157. @src
8.158. @val
8.159. <anchor-div-ref>
8.160. <div-ref>
8.161. <div-type-ref>
8.162. <equate-div-types>
8.163. <equate-works>
8.164. <realign>
8.165. <split-leaf-div-at>
8.166. <split-leaf-div-at>
8.167. <TAN-A-div>
8.168. <TAN-A-div>
8.169. <TAN-A-div>
8.170. @seg
8.171. @work
8.172. <align>
8.173. <bitext-relation>
8.174. <bitext-relation>
8.175. <bitext-relation>
8.176. <reuse-type>
8.177. <reuse-type>
8.178. <reuse-type>
8.179. <TAN-A-tok>
8.180. <TAN-A-tok>
8.181. <TAN-A-tok>
8.182. @bitext-relation
8.183. @bitext-relation
8.184. @bitext-relation
8.185. @reuse-type
8.186. @reuse-type
8.187. @reuse-type
8.188. <ana>
8.189. <l>
8.190. <lexicon>
8.191. <lexicon>
8.192. <lm>
8.193. <morphology>
8.194. <morphology>
8.195. <morphology>
8.196. <TAN-LM>
8.197. <TAN-LM>
8.198. <TAN-LM>
8.199. @lexicon
8.200. @lexicon
8.201. @lexicon
8.202. @morphology
8.203. @morphology
8.204. @morphology
8.205. <item>
8.206. <TAN-key>
8.207. <TAN-key>
8.208. <TAN-key>
8.209. <TAN-key>
8.210. <feature>
8.211. <TAN-mor>
8.212. @code
8.213. <locus>
8.214. <modal>
8.215. <object>
8.216. <person>
8.217. <scriptum>
8.218. <verb>
8.219. @adverb
8.220. @claim-basis
8.221. @claimant
8.222. @object-datatype