Core Technology

TAN depends upon a core set of relatively stable technologies. Those technologies and the underlying terminology are very briefly defined and explained below, as far as they affect the TAN format. References to further reading will lead you to better and more thorough introductions. The central goal of this section is to highlight any decisions made in the design of TAN that significantly affect how anyone might create or interpret TAN-compliant data.

Unicode is the worldwide standard for the consistent encoding, representation, and exchange of digital texts. The standard, stable but still growing, is intended to represent all the world's writing systems, living and historical. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, independent of specific fonts. Any Unicode-compliant text is (in general) semantically interoperable on the character level and can be exchanged between users and systems, no matter what font might be used to display the text. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still intact and valid. Styling the text with a font that does support the alphabet will reveal this to be the case.

With more than 128,000 characters, Unicode is almost as complex as human writing itself, and so has system of organization commensurately complicated. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet. Within each block, characters may be grouped further. Each character is assigned a single codepoint.

Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode with a related numeration system. Codepoints are therefore numbered according to a hexadecimal system (base 16), which is larger than our most common system, the decimal (base 10). The hexadecimal system uses the digits 0 through 9 and the letters A through F. (The number 10 in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) To find Unicode codepoint values is therefore helpful to think of the corpus of glyphs as a very long ribbon sixteen squares wide. This is illustrated nicely in this article. Each position along the width is labeled with a hexadecimal number (0-9, A-F) that always identifies the last digit of a character's code point value.

It is common to refer to Unicode characters by their value or their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four digits. The official Unicode name is usually given fully in uppercase. Examples:

The TAN format allows the following characters anywhere, but assign special meaning in certain contexts:

When these characters occur at the end of a leaf <div>, perhaps followed by white space that will be ignored (see below), processors will assume that the character is to be deleted, and when combined with the next leaf div, no intervening space should be allowed. Furthermore, because these characters are difficult to discern from spaces and hyphens, any output based on the character mapping of the core functions will replace these characters with their XML entities, &#x200d; and &#xad;.

Defined by the W3C, the eXtensible Markup Language (XML) is a machine-actionable markup language that facilitates human readability.

At its heart, XML is rather simple. It begins with an opening line that declares that what otherwise would look just like plain text is an XML file. It then proceeds to the data, which must marked by one or more pairs of tags. An opening tag looks like <tag> and a closing like </tag> (or if the tags contain no data, this can be collapsed into one: <tag/>). A pair of matching tags is called an element. Elements must nest within each other. They cannot overlap. For example:

<?xml version="1.0" encoding="UTF-8"?>
<p>A paragraph about 

This nesting relationship of elements means that an XML document can be pictured as a tree, a metaphor that provides a host of technical names for the relationships that hold between elements: root, parent, child, sibling, ancestor, and descendant. In the example above, the root element <p> is the parent of <name> and the ancestor of <name>, <first>, and <last>. The element <first> is a child of <name> and a descendant of both <name> and <p>. <first> and <last> are siblings to each other.

The opening tag of an element might have additional nodes called attributes, recognized by a word, an equals sign, and then some text within quotation marks (single or double), e.g., id="self". An element may have many attributes, and those attributes can appear in any order. Attributes can be thought of as leaves on an XML tree. They are intended to carry simple data (usually metadata about the data contained by the element), because they cannot govern anything else.

<?xml version="1.0" encoding="UTF-8"?>
<p n="1" id="example">A paragraph about <name><first>Mary</first> <last>Lee</last></name>.</p>

The two examples above are functionally equivalent. The first takes up several lines whereas the second has only two. But they're still equivalent. That is because in most XML projects extra lines, spaces, and indentation are effectively ignored by processors, to give human editors the flexibility they need to optimize indentation for readability. Therefore, continuous strings of multiple spaces, tabs, and newline/carriage return are to be treated as a single space. (See below.)

XML allows for other rules to be added, if an individual or group so wishes. These rules, called schemas, can allow great flexibility or be very strict. The TAN schemas tend to the latter.

XML allow users to develop vocabularies of elements as they wish. One person may wish to use <bank> to refer to financial institutions, another to rivers. XML was designed to allow users to mix vocabularies, even when those vocabularies use synonymous element names. This means that anyone using <bank> must be able to specify exactly whose meaning of <bank> is intended. Disambiguation is accomplished by associating IRIs (see the section called “Identifiers and Their Use” below) with the element names. The actual full name of an element is the local name plus the IRI that qualifies its meaning, e.g., bank{} and bank{}.

The relationship between the element name and the IRI is analogous to that between a person's given name and family name. The IRI—the family name—is called the namespace. If the term sounds like meaningless jargon, you may find it easier to think of it as the name of a group of elements.

Namespaces look a lot like attributes (they aren't). They take the form <bank xmlns="">...</bank>, which states, in effect not only which namespace governs bank <bank>, but what the default namespace will be for any descendants.

But supposing we wished to combine the two type of <bank> elements, we can assign abbreviations to select namespaces, then append those abbreviations to the element names, separated by a colon. Here are three ways to say the same thing, showing the use of prefix abbreviations and default namespaces:

<bank xmlns="">
    <bank xmlns="">

<bank xmlns="" xmlns:e2="">
    <e2:bank >

<e1:bank xmlns:e1="" xmlns:e2="">
    <e2:bank >

The TAN namespace is,2015:ns. The recommended prefix is tan. The namespace is expected to remain the same from one version to the next.

Any TAN-T module can be easily cast into a TEI file, although much of the computer-actionable semantics will be lost in the process. Likewise, a TEI file can be converted to TAN-T, but there is a greater risk of loss of content, since the TAN format is intentionally restricted to an important but small subset of TEI tags.

The TAN-TEI module is a TEI extension to the format, based on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release.

For more about the strictures placed upon the TEI All schema see the section called “Transcriptions Using the Text Encoding Initiative (<TEI>)”. See also Chapter 4, Patterns and Structures Common to All TAN Encoding Formats and Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta).

XML files admit of a process called validation, which checks to see if all the declared rules have been followed. These validation rules are kept in files called schemas, plain-text files that state declare the rules of the format. Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type (written in RELAX-NG) the other with very detailed rules (written in Schematron).

When a version of TAN is published, minor updates may be published tacitly, but they will endeavor to not render files fomerly valid as invalid.

TAN has been revised frequently and deeply as new use cases and examples have presented themselves. Indeed, the development of TAN has resulted in insight into theories of editing and text. Because of this dialectical process, there is no guarantee that one version will be compatible with any other. After the format matures, there may be guarantee of cross- or backward-compatibility.

Being a written purely in XML technologies, TAN adopts its data types, e.g., strings, booleans, and so forth. The official specifications are made by the W3C. The following data types require some special comments.

The acronyms for identifiers, and the meanings of those acronyms, can be mystifying. Here is a quick guide:

Identifiers are used in many contexts for many purposes. One of the key purposes close to those of TAN involves what is called variously Linked Open Data (LOD) or the Semantic Web. These technologies rely upon a very simple data model called Resource Description Framework (RDF), a family of World Wide Web Consortium (W3C) specifications originally designed as a data model for metadata. The foundation of the model is the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that act as names of things, as does the object, which also allows for data type. The concept of LOD is that if we use URLs as identifiers for things, then we can create web pages at those URLs that provide human and computer readers with related information. And as we begin to use the same URLs for the same concepts, then independently created datasets can be combined and compared by a computer, allowing inferences that might not otherwise be possible.

These URL identifiers only look like a web page address (e.g., http://...), but are fundamentally a name for something. Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found. But ordinary users may be forgiven for not knowing whether the URL is a web page or a name for something else.

Many parts of TAN map nicely onto RDF and vice versa. In fact, TAN tends to be easier for humans to read and write than does RDF. For example, this snippet (taken from, written in Turtle syntax, one of the easiest RDF syntaxes, ...

1 @prefix rdf: <> . 
2 @prefix foaf: <> . 
4 <> 
5 rdf:type foaf:Person ; 
6 foaf:name "Dave Smith" .

...has this TAN equivalent:

<person xml:id="dsmith">
   <name>Dave Smith</name>

In cases such as this, TAN and RDF are equivalent. In more advanced statements, that is not the case, since TAN was designed to represent assertions and claims made by scholars for other scholars. For more details see the section called “Claims and assertions (TAN-c)”.

More reading:

TAN files make extensive use of tag URNs. In fact, TAN's namespace is a tag URN (the section called “Namespaces”). A tag URN has two parts:

  1. Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01.

  2. Name of the TAN file. : + an arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It need not be the same as the filename stored on a local directory. You should pick a name that is at least somewhat intelligible to human readers.

Tag URNs are required for labeling and identifying TAN files for a number of reasons, foremost of which are their suitability for enduring naming schemes. That is, they will remain valid centuries from now, long after the death of the owner of a domain name or email address or if those accounts pass into the hands of others. The TAN format requires every piece of data to be attributable to someone (a person, organization, or some other agent) and tag URNs facilitate that requirement in ways that other URNs cannot. Further, tag URNs allow anyone to name things uniquely. You do not need to register your tag URN or have a website that explains the nomenclature (a requirement, sometimes daunting, in http-based IRIs for linked data).

Great care must be taken in choosing the IRI name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple IRIs, but never acceptable for an IRI to name more than one thing. It is a good practice to keep a master checklist of IRI names you have created. If you find yourself forgetting, or think you run the risk of creating duplicate IRI names, you should start afresh by creating a new namespace for your tag URNs, easily done just by changing the date in the tag URN namespace. That is, if,2015:... seems to be overly cluttered, you may start a new set of names with something else, e.g.,,2015-01-02:....

The TAN encoding format has chosen tag URNs over URLs for several reasons:

Further reading:

  • RFC 4151, the official definition of tag URNs

Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, it means rules (Latin regula), and points to a rule-based syntax that provides expressive power in algorithms that search and replace text. Regular expressions come in different flavors, and have several layers of complexity. So these guidelines are restricted to a synopsis that illustrates very common uses that conform to the definition of regular expressions found in the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0.


XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions, \w, but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF divide between word and non-word categories according to XML (and therefore TAN):

Word characters (\w): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W): ! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · » ¿

Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary.

A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special escape character: . [ ] \ | - ^ $ ? * + { } ( ). Here is a brief key to how characters behave in regular expressions, provided they are not in square brackets (on which see the recommended reading below):

Some examples:

The examples above provide a taste of how regular expressions are constructed and read. For further examples especially relevant to TAN see <filter>.

[Warning]Regular Expressions and Combining Characters

Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will be most keenly felt in the use of combining characters in Unicode. Given a string &#x61;&#x301;&#x62; = áb (i.e., an acute accent over the a), a search pattern a. will in some search engines include the b and others not.

Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. If you find the need to count characters, and you are working with a language that uses combining characters, you should count only base characters, not combining ones. In fact, TAN assumes that in cases where characters are identified with a numeral, the numeral excludes combining characters. See the section called “Combining characters”. Further, any regular expressions with wildcard characters cannot be expected to be treated uniformly.

Further reading: