Core Technology

TAN depends upon a set of relatively stable technologies. Those technologies and the underlying terminology are very briefly defined and explained below, with particular attention to interpretive decisions that have been adopted by TAN validation rules. References to further reading will lead you to better and more thorough introductions.

Unicode is the worldwide standard for the consistent encoding, representation, and exchange of digital texts. Stable but still growing, Unicode is intended to represent all the world's writing systems, living and historical. Maintained by a nonprofit organization, the Unicode standard allows us to share texts in any alphabet and reliably share that data with other people, independent of individual fonts.

With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet or a set of characters that share something in common. Within each block, characters may be grouped further. Each character is assigned a single codepoint.

Because computers work on the binary system, codepoints have been numbered according to the related hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The number 10 in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long ribbon sixteen squares wide, a glyph in each square. This is illustrated nicely in this article. Each position along the width is labeled with a hexadecimal number (0-9, A-F) that always identifies the last digit of a character's code point value.

It is common to refer to Unicode characters by their value or their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four digits. The official Unicode name is usually given fully in uppercase. Examples:


When the characters U+200D ZERO WIDTH JOINER and U+00AD SOFT HYPHEN occur at the end of a leaf <div>, perhaps followed by white space that will be ignored (see below), processors will assume that the character is to be deleted, and when combined with the next leaf div, no intervening space should be allowed. Furthermore, because these characters are difficult to discern from spaces and hyphens, any output based on the character mapping of the core functions should replace these characters with their XML entities, &#x200d; and &#xad;.

Validation files are found here: http://textalign.net/release/TAN-2018/schemas/.

Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type (written in RELAX-NG) the other with very detailed rules (written in Schematron).

The RELAX-NG rules are written primarily in compact syntax (.rnc), and then converted to the XML syntax (.rng). For TAN-TEI, the special format One Document Does it all (.odd) is used to alter the rules for TEI All.

The Schematron files are generally quite short. The primary work is done by a large function library written in XSLT. For more on this process, see the section called “Doing things with TAN files”.

Some validation engines that process a valid TAN-compliant TEI file may return an error something like conflicting ID-types for attribute "who" of element "comment" from namespace "tag:textalign.net,2015:ns". Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. Very often, it is possible for you to configure an XML editor to ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

By default in XML, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom they need to format XML documents as they like, for either human readability or compactness.

All TAN formats assume space normalization, with an extra caveat, namely, that some space is assumed to exist between adjacent leaf <div>s, even if no text node intervenes. This behavior is overridden if the first leaf <div> ends in the soft hyphen or the zero width joiner; see the section called “Unicode characters with special interpretation”).

The TAN format does not stipulate how space-only text nodes should be interpreted. It is up to processors to analyze the relevant <div-type> to infer an appropriate type fo white-space separator.

If retention of multiple spaces is important for your research, then TAN formats may not be appropriate, since TAN is not intended to replicate the appearance of a scriptum. Pure TEI (and not TAN-TEI) might be a practical alternative, since it allows for a literal use of space, and encourages XML files that try to replicate the appearance of a scriptum.

For more on white space see the W3C recommendation.

XML allow users to develop vocabularies of elements as they wish. One person may wish to use the element <bank> to refer to financial institutions, another to rivers. Perhaps someone wishes to mention both rivers and financial institutions in the same document. XML was designed to allow users to mix vocabularies, even when those vocabularies use synonymous element names. This means that anyone using <bank> must be allowed to specify exactly which vocabulary is being used. Disambiguation is accomplished by associating IRIs (see the section called “Identifiers and Their Use” below) with the element names. The actual full name of an element is the local name plus the IRI that qualifies its meaning, e.g., bank{http://example1.com/terms/} and bank{http://example2.com/terms/}.

The relationship between the element name and the IRI is analogous to that between a person's given name and their family name. The IRI—the family name—is called the namespace—not an ideal term, but the one that has been adopted. Think of the namespace as the family name for a group of elements.

Namespaces look a lot like attributes (they aren't). They take the form xmlns="http://example1.com/terms/" (defining the default namespace) or xmlns:[PREFIX]="http://example2.com/terms/" (defining a namespace that has been assigned a particular prefix) placed inside an opening tag. For example, <bank xmlns="http://example1.com/terms/">...</bank> states, in effect, the namespace for <bank> and the default namespace for all descendants (it can be explicitly overridden).

Different types of <bank> can be mixed through namespaces:

<bank xmlns="http://example1.com/terms/">
    <bank xmlns="http://example2.com/terms/">
        ...
    </bank>
</bank>

<bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    <e2:bank >
        ...
    </e2:bank>
</bank>

<e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    <e2:bank >
        ...
    </e2:bank>
</e1:bank>

The Text Encoding Initiative (TEI) is a collection of XML rules for the representation of texts in digital form. Developed and maintained by a consortium of scholars and scholarly organizations, TEI includes not only a library of schemas, but guidelines and stylesheetsmore. The TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software.

[Note]Note

Taken from the TEI website http://www.tei-c.org/index.xml, accessed 2017-05-21.

Any TAN-T module can be easily cast into a TEI file, although much of the computer-actionable semantics will be lost in the process. Likewise, a TEI file can be converted to TAN-T, but there is a greater risk of loss of content, particularly in the header, since the non-TEI TAN formats are restricted to a small subset of TEI tags.

TAN-TEI is TAN's TEI extension, based on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release.

For more about the strictures placed upon the TEI All schema see the section called “Transcriptions Using the Text Encoding Initiative (<TEI>)”. See also Chapter 4, Patterns and Structures Common to All TAN Encoding Formats and Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta).

Being written purely in XML technologies, TAN adopts its data types, e.g., strings, booleans, and so forth, from the official specifications made by the W3C. The following data types require some special comments.

The acronyms for identifiers, and the meanings of those acronyms, can be mystifying. Here is a synopsis:

  • IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987

  • URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986

  • URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource.

  • URN: Uniform Resource Name, a term that originally referred to persistent names using the urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs.

  • UUID: Universally Unique Identifier, a computer-generated 128-bit number used to assign identifiers to any entity. UUIDs can be built into a URN by prefixing them with urn:.

The TAN format generally prefers to refer to IRIs.

See also the section called “Tag URNs”.

Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly way to read and write RDF. Compare, for example, this snippet (taken from http://linkeddatabook.com/editions/1.0/), written in Turtle syntax, ...

1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
2 @prefix foaf: <http://xmlns.com/foaf/0.1/> . 
3 
4 <http://biglynx.co.uk/people/dave-smith> 
5 rdf:type foaf:Person ; 
6 foaf:name "Dave Smith" .

...with the TAN equivalent:

<person xml:id="dsmith">
   <IRI>http://biglynx.co.uk/people/dave-smith</IRI>
   <name>Dave Smith</name>
</person>

In this case TAN and RDF are converted losslessly. But in many other cases, TAN statements cannot be reduced to the RDF model. This happens most often in the context of <claim>, which is designed to allow scholarly assertions and claims that are difficult or impossible to express in RDF. For example, RDF does not allow one to say "Person X is not the author of text Y."

TAN claims have adapted the core concepts behind RDF to cater to scholarly needs. For more details see the section called “Division-Based Annotations and Alignments (<TAN-A-div>)”.

TAN files make extensive use of tag URNs (see the section called “Identifiers and Their Use”). In fact, TAN's namespace is a tag URN (the section called “Namespaces”). A tag URN has two parts:

  1. Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01.

  2. Name of the TAN file. : + an arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It need not be the same as the filename stored on a local directory. You should pick a name that is at least somewhat intelligible to human readers.

Although you may use any tag URN coined by someone else, you may create a tag URN only if you are the owner of that URN's namespace.

Great care must be taken in choosing the IRI name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple IRIs, but never acceptable for an IRI to name more than one thing. It is a good practice to keep a master checklist of IRI names you have created. If you find yourself forgetting, or think you run the risk of creating duplicate IRI names, you should start afresh by creating a new namespace for your tag URNs, easily done just by changing the date in the tag URN namespace.


The TAN encoding format has chosen tag URNs over URLs for several reasons:

Further reading:

  • RFC 4151, the official definition of tag URNs

Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, it derives from Latin regula, and points to a rule-based syntax that provides patterns for finding and replacing text. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0.

[Caution]Caution

XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF are categorized according to XML (and therefore TAN):

Word characters (\w): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W): ! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · » ¿

Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary.

A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special escape character: . [ ] \ | - ^ $ ? * + { } ( ). Here is a brief key to how characters behave in regular expressions, provided they are not in square brackets (on which see the recommended reading below):


Some examples:


The examples above provide a taste of how regular expressions are constructed and read.

[Warning]Regular Expressions and Combining Characters

Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will most keenly be felt in the use of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a, &#x61;&#x301;&#x62;). The regular expression a. will in some search engines include the b and others not.

Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see the section called “Combining characters”).

TAN includes several functions that usefully extend XML regular expressions. See tan:regex, tan:matches(), tan:replace(), tan:tokenize().

Further reading: