Core Technology

Core Technology
Prev	Chapter 3. General Underpinnings	Next

Unicode

What is it?

Unicode is the worldwide standard for the consistent encoding, representation, and exchange of digital texts. Stable but still growing, Unicode is intended to represent all the world's writing systems, living and historical. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, independent of individual fonts. Any Unicode-compliant text is in general semantically interoperable on the character level and can be exchanged between users and systems, no matter what font might be used to display the text. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still intact and valid. Styling the text with a font that does support the alphabet will reveal this to be the case.

With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet or a set of characters that share something in common. Within each block, characters may be grouped further. Each character is assigned a single codepoint.

Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode with a related numeration system. Codepoints are therefore numbered according to a hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The number 10 in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) To find Unicode codepoint values is therefore helpful to think of the corpus of glyphs as a very long ribbon sixteen squares wide. This is illustrated nicely in this article. Each position along the width is labeled with a hexadecimal number (0-9, A-F) that always identifies the last digit of a character's code point value.

It is common to refer to Unicode characters by their value or their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four digits. The official Unicode name is usually given fully in uppercase. Examples:

Table 3.1. Unicode characters

Character	Unicode value	Unicode name
" " (space)	U+0020	SPACE
®	U+00AE	REGISTERED SIGN
ю	U+044E	CYRILLIC SMALL LETTER YU

Normalization

TAN validation rules require all data to be normalized according to the Unicode NFC algorithm. Any text in a TAN body that does not comply will be marked as invalid. Validation engines that support Schematron Quick Fixes will allow users to easily convert non-normalized to normalized Unicode.

Unicode characters with special interpretation

The TAN format allows the following characters anywhere, but assign special meaning in certain contexts:

U+200D ZERO WIDTH JOINER
U+00AD SOFT HYPHEN

When these characters occur at the end of a leaf <div>, perhaps followed by white space that will be ignored (see below), processors will assume that the character is to be deleted, and when combined with the next leaf div, no intervening space should be allowed. Furthermore, because these characters are difficult to discern from spaces and hyphens, any output based on the character mapping of the core functions should replace these characters with their XML entities, ‍ and .

Combining characters

At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently or as part of a base character, nor does the family of XML languages. In most circumstances, this point is negligible. But it affects regular expressions and XPath expressions (see below).

Two of the class 2 formats allow the counting of characters. Such counting is assumed to be made exclusively of non-combining characters, defined as the regular expression [^\p{M}]. Any numerical reference made in a TAN file to an individual character will be found by counting only non-combining characters, and will return that base character combined with all combining characters that immediately follow. Any <div> that starts with a combining character will be marked as invalid. See also Regular Expressions and Combining Characters.

Deprecated Unicode points

Because TAN is focused not at all on appearance, the following characters will generate an error if found in a TAN file:

U+00A0 NO-BREAK SPACE
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE

eXtensible Markup Language (XML)

What is it?

Defined by the W3C, the eXtensible Markup Language (XML) is a machine-actionable markup language that facilitates human readability.

At its heart, XML is rather simple. It begins with an opening line that declares that what otherwise would look just like plain text is an XML file. It then proceeds to the data, which must marked by one or more pairs of tags. An opening tag looks like <tag> and a closing like </tag> (or if the tags contain no data, this can be collapsed into one: <tag/>). A pair of matching tags is called an element. Elements must nest within each other. They cannot overlap. For example:

<?xml version="1.0" encoding="UTF-8"?>
<p>A paragraph about 
  <name>
    <first>Mary</first> 
    <last>Lee</last></name>.</p>

This nesting relationship of elements means that an XML document can be pictured as a tree, a metaphor that provides a host of technical names for the relationships that hold between elements: root, parent, child, sibling, ancestor, and descendant. In the example above, the root element <p> is the parent of <name> and the ancestor of <name>, <first>, and <last>. The element <first> is a child of <name> and a descendant of both <name> and <p>. <first> and <last> are siblings to each other.

The opening tag of an element might have additional nodes called attributes, recognized by a word, an equals sign, and then some text within quotation marks (single or double), e.g., id="self". An element may have many attributes, and those attributes can appear in any order. Attributes can be thought of as leaves on an XML tree. They are intended to carry simple data (usually metadata about the data contained by the element), because they cannot govern anything else.

<?xml version="1.0" encoding="UTF-8"?>
<p n="1" id="example">A paragraph about <name><first>Mary</first> <last>Lee</last></name>.</p>

The two examples above are functionally equivalent. The first takes up several lines whereas the second has only two. But they're still equivalent. That is because in most XML projects extra lines, spaces, and indentation are effectively ignored by processors, to give human editors the flexibility they need to optimize indentation for readability. Therefore, continuous strings of multiple spaces, tabs, and newline/carriage return are to be treated as a single space. (See below.)

XML allows for other rules to be added, if an individual or group so wishes. These rules, called schemas, can allow great flexibility or be very strict. The TAN schemas tend to the latter.

Schemas and validation

Validation files are found here: http://textalign.net/release/TAN-1-dev/schemas/.

Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type (written in RELAX-NG) the other with very detailed rules (written in Schematron).

The RELAX-NG rules are written primarily in compact syntax (.rnc), and converted to the XML syntax (.rng). For TAN-TEI, the special format One Document Does it all (.odd) is used to alter the rules for TEI All.

The Schematron files are generally quite simple, acting as a conduit to a large function library written in XSLT. For more on this process, see the section called “Doing Things with TAN Files (Stylesheets and the Function Library)”.

Some validation engines that process a valid TAN-compliant TEI file may return an error something like conflicting ID-types for attribute "who" of element "comment" from namespace "tag:textalign.net,2015:ns". Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. Very often, it is possible for you to configure an XML editor to ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

White space

In any XML file, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom they need to format XML documents as they like, for either human readability or compactness.

All TAN formats assume data will be pre-processed with space normalization, as defined by the standard XML function fn:normalize-space(), which trims space from the beginning and end of a text node or string, and replaces consecutive space marks with a single space. Some space is assumed to exist between adjacent leaf <div>s, even if no space intervenes (unless if the first <div> ends in the soft hyphen or the zero width joiner; see the section called “Unicode characters with special interpretation”). What type of space is not dictated by the TAN format. It is up to processors to analyze the relevant <div-type> to interpret what kind of white-space separator is appropriate.

If retention of multiple spaces is important for your research, then TAN formats may not be an appropriate format, since TAN is not intended to replicate the appearance of a scriptum. Pure TEI (and not TAN-TEI) might be a practical alternative, since it allows for a literal use of space, and encourages XML files that try to replicate the appearance of a scriptum.

For more on white space see the W3C recommendation.

Non-mixed content

Many familiar text formats such as TEI, HTML, and Docbook allow what is called mixed content, i.e., elements and nonspace text nodes may be combined as siblings. The TAN formats, aside from TAN-TEI, are committed to a non-mixed content model. Nonspace text nodes and elements are never siblings. The practical effect of this policy is that indentation may be applied to a TAN file as one wishes, and space text nodes may be inserted between any two adjacent elements, without affecting the meaning.

To specify in a class 1 file that two adjacent leaf <div>s should have no intervening space, see the section called “Unicode characters with special interpretation”.

Namespaces

What are they?

XML allow users to develop vocabularies of elements as they wish. One person may wish to use the element <bank> to refer to financial institutions, another to rivers. Perhaps someone wishes to mention both rivers and financial institutions in the same document. XML was designed to allow users to mix vocabularies, even when those vocabularies use synonymous element names. This means that anyone using <bank> must be able to specify exactly whose vocabulary is being used. Disambiguation is accomplished by associating IRIs (see the section called “Identifiers and Their Use” below) with the element names. The actual full name of an element is the local name plus the IRI that qualifies its meaning, e.g., bank{http://example1.com/terms/} and bank{http://example2.com/terms/}.

The relationship between the element name and the IRI is analogous to that between a person's given name and family name. The IRI—the family name—is called the namespace. If the term sounds like meaningless jargon, you may find it easier to think of it as the name of a group of elements.

Namespaces look a lot like attributes (they aren't). They take the form <bank xmlns="http://example1.com/terms/">...</bank>, which states, in effect not only which namespace governs bank <bank>, but what the default namespace will be for any descendants.

But supposing we wished to combine the two type of <bank> elements, we can assign abbreviations to select namespaces, then append those abbreviations to the element names, separated by a colon. Here are three ways to say the same thing, showing the use of prefix abbreviations and default namespaces:

<bank xmlns="http://example1.com/terms/">
    <bank xmlns="http://example2.com/terms/">
        ...
    </bank>
</bank>

<bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    <e2:bank >
        ...
    </e2:bank>
</bank>

<e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    <e2:bank >
        ...
    </e2:bank>
</e1:bank>

TAN namespace and prefix

The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The namespace is expected to remain the same from one version to the next.

The TAN-TEI format uses as its default the TEI namespace, http://www.tei-c.org/ns/1.0, normally given the prefix tei.

The Text Encoding Initiative

What is it?

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI.

	Note
	Taken from the TEI website http://www.tei-c.org/index.xml, accessed 2017-05-21.

Any TAN-T module can be easily cast into a TEI file, although much of the computer-actionable semantics will be lost in the process. Likewise, a TEI file can be converted to TAN-T, but there is a greater risk of loss of content, particularly in the header, since the TAN format is intentionally restricted to an important but small subset of TEI tags.

The TAN-TEI module is a TEI extension to the format, based on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release.

For more about the strictures placed upon the TEI All schema see the section called “Transcriptions Using the Text Encoding Initiative (<TEI>)”. See also Chapter 4, Patterns and Structures Common to All TAN Encoding Formats and Chapter 5, Class-1 TAN Files, Representations of Textual Objects (Scripta).

Data types

Being a written purely in XML technologies, TAN adopts its data types, e.g., strings, booleans, and so forth, from the official specifications made by the W3C. The following data types require some special comments.

Languages

TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes with high precision the way languages are identified. For most users of TAN, this will be a simple three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English generally, English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase.

ISO codes for human languages appear in @xml:lang and <for-lang>. The first indicates the principal language of the text enclosed by the parent element. The second indicates that some statement or claim is being made about a specific language language. For example, <for-lang> in the context of a TAN-mor file indicates languages for which the encoded morphological rules are appropriate.

For more information, see one of the following:

BCP 47 official specifications
BPC 47 technical details

Dates and times

TAN adopts the standardized ISO form of dates and date-times, as interpreted by XML data types. These begin with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest). This standard allows for easy sorting

The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m. on the Eastern Time Zone.

Identifiers and Their Use

The acronyms for identifiers, and the meanings of those acronyms, can be mystifying. Here is a synopsis:

IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987
URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986
URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource.
URN: Uniform Resource Name, a term that originally referred to persistent names using the urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs.
UUID: Universally Unique Identifier, a computer-generated 128-bit number used to assign identifiers to any entity. UUIDs can be built into a URN by prefixing them with urn:.

The TAN format generally prefers to refer to IRIs.

See also the section called “Tag URNs”.

Resource Description Framework (RDF) and Linked Open Data

What are they?

Identifiers are used in many contexts for many purposes. One of the key purposes close to those of TAN involves what is called variously Linked Open Data (LOD) or the Semantic Web. These technologies rely upon a very simple data model called Resource Description Framework (RDF), a family of World Wide Web Consortium (W3C) specifications originally designed as a data model for metadata. The foundation of the model is the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that act as names of things, as does the object, which also allows for data type. The practical impetus to LOD is that if we use URLs as identifiers for things, then we can create web pages at those URLs that provide humans and computers with related, linked information. And as we begin to use the same URLs for the same concepts, then independently created datasets can be combined and compared into a whole that admits inferences not possible with the parts alone.

These URL identifiers look like a web page address (e.g., http://...), but are first and foremost names for things (the "Resource" behind RDF is a clumsy term pointing to person, place, concept—anything at all). Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found. But ordinary users may be forgiven for not knowing whether the URL is a web page or a name for something else.

TAN and RDF

Many parts of TAN map nicely onto RDF and vice versa. In fact, TAN tends to be easier for humans to read and write than does RDF, even in its most straightforward syntax. Compare, for example, this snippet (taken from http://linkeddatabook.com/editions/1.0/), written in Turtle syntax, ...

1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
2 @prefix foaf: <http://xmlns.com/foaf/0.1/> . 
3 
4 <http://biglynx.co.uk/people/dave-smith> 
5 rdf:type foaf:Person ; 
6 foaf:name "Dave Smith" .

...with the TAN equivalent:

<person xml:id="dsmith">
   <IRI>http://biglynx.co.uk/people/dave-smith</IRI>
   <name>Dave Smith</name>
</person>

In this case TAN and RDF are converted losslessly. But in many cases, TAN statements cannot be reduced to the RDF model. This happens most often in the context of <claim>, which is designed to allow scholarly assertions and claims that are difficult or impossible to express in RDF. For example, RDF does not allow one to say "Person X is not the author of text Y." TAN claims have been designed specifically to cater to such common scholarly expressions. For more details see the section called “Claims and assertions (TAN-c)”.

Tag URNs

TAN files make extensive use of tag URNs (see the section called “Identifiers and Their Use”). In fact, TAN's namespace is a tag URN (the section called “Namespaces”). A tag URN has two parts:

Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01.
Name of the TAN file. : + an arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It need not be the same as the filename stored on a local directory. You should pick a name that is at least somewhat intelligible to human readers.

Great care must be taken in choosing the IRI name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple IRIs, but never acceptable for an IRI to name more than one thing. It is a good practice to keep a master checklist of IRI names you have created. If you find yourself forgetting, or think you run the risk of creating duplicate IRI names, you should start afresh by creating a new namespace for your tag URNs, easily done just by changing the date in the tag URN namespace. That is, if tag:textalign.net,2015:... seems to be overly cluttered, you may start a new set of names with something else, e.g., tag:textalign.net,2015-01-02:....

Example 3.1. TAN IRI names

tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:hamlet-tan-t
tag:evagriusponticus.net,2014:tan-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc

The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names, not those of email addresses. The 2014 in the fourth example is shorthand for the first second of January 1, 2014.

The TAN encoding format has chosen tag URNs over URLs for several reasons:

Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries in the future, well after specific domain names have changed ownership or fallen into obsolesence, and well after the creators are dead. To mint names according to URLs is inadequate for long-term use, since it has no built-in mechanism to identify who owned the domain name in question when the name was minted.
Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, organization, or some other agent). Tag URNs attached the responsibility for naming objects to a particular person or organization that owned the tag namespace at the specified time.
Accessibility. Tag URNs are available to anyone who has an email address. No one has to register with any central authority. You can begin naming anything you want, any time you want, without seeking anyone's approval.
Ease. Tag URNs are easier to use than, say, http-form URLs, as recommended by RDF (see the section called “Resource Description Framework (RDF) and Linked Open Data”). Many potential TAN authors never have owned a domain name, and never will. Further, many of those who do own domain names cannot or do not wish to configure and maintain servers that will administer the referral mechanisms upon which the semantic web depends.
Disambiguation of name and location. In the semantic web, conflation of name with a location to resolve it is considered a virtue because a single string answers two questions: what is the resource and where can I find out more about it. But this conflation is unhelpful for those who use the TAN formats, who are encouraged to distribute their TAN files widely, and not rely upon a single location. And URLs are in common parlance interpreted as locations for data, not as names for things. TAN-compliant tag URLs ensure that the names of concepts and objects do not look like locations, maintaining a distinction that has always been a foundational principle in scholarly citation, namely, that one should always distinguish the name of a resource from where it might be found.

Regular Expressions

Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, it means rules (Latin regula), and points to a rule-based syntax that provides expressive power in algorithms that search and replace text. Regular expressions come in different flavors, and have several layers of complexity. So these guidelines are restricted to a synopsis that illustrates very common uses that conform to the definition of regular expressions found in the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0.

Caution

	Caution
XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, \|, is treated as a word character in XML regular expressions (`\w`), but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF are categorized according to XML (and therefore TAN): Word characters (`\w`): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z \| ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Non-word characters (`\W`): `! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ¶ · » ¿` Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary.

XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF are categorized according to XML (and therefore TAN):

Word characters (\w): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W): ! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ¶ · » ¿

Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary.

A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special escape character: . [ ] \ | - ^ $ ? * + { } ( ). Here is a brief key to how characters behave in regular expressions, provided they are not in square brackets (on which see the recommended reading below):

Table 3.2. Special characters in regular expressions

Symbol	Meaning
$	end of line
.	any character
\|	or (union)
^	start of line
?	zero or one
*	zero or more
+	one or more
[ ]	a class of characters
( )	a group
\w	any word character
\W	any nonword character
\s	any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D)
\S	anything not a spacing character
\d	any digit (0-9)
\D	anything not a digit
\p{IsGujarati}	any character from the Unicode block named Gujarati
\\	backslash (the backslash alone suggests that the next character is a special character)
\$	dollar sign
\(	opening parenthesis
\[	opening square bracket

Some examples:

Table 3.3. Examples of Regular Expressions

Expression	Meaning	What the expression matches when applied to "Wi-fi, good. A_hem* isn't!"
`^.+$`	one whole line of characters	"Wi-fi, good. A_hem* isn't!"
`[ae]`	a or e	"e"
`[a-e]`	a, b, c, d, or e	"d", "e"
`[^ae]+`	one or more characters that are anything except a or e	"Wi-fi, good. A_h", "m* isn't!"
`.i`	any character followed by i.	"Wi", "fi", " i"
`(.i)`	when a character followed by an i is found treat it as a capture group (used only in a search pattern)	"Wi", "fi", " i"
`$1`	first capture group (used only in a replacement pattern, and corresponds to the sequence of capture groups in the search pattern)	In the example above, each match corresponds to $1
`[aeiou]\w*`	any lowercase vowel along with every word character that follows	"i", "i", "ood", "em", "isn"
`[t*].`	any t or * and the following character	"* ", "t!" Note that the asterisk, if inside a character class, acts as itself.
`\s+`	match one or more space characters	" ", " ", " "
`\w+`	match one or more word characters	"Wi", "fi", "good", "A_hem", "isn", "t"
`\W+`	match one or more nonword characters	"-", ", ", ". ", "* ", "'", "!"
`[^q]+`	one or more characters that are not a q	"Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read. For further examples especially relevant to TAN see <filter>.

	Regular Expressions and Combining Characters
Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will be most keenly felt in the use of combining characters in Unicode. Given a string `áb` = áb (i.e., an acute accent over the a), a search pattern `a.` will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. If you find the need to count characters, and you are working with a language that uses combining characters, you should count only base characters, not combining ones. In fact, TAN assumes that in cases where characters are identified with a numeral, the numeral excludes combining characters. See the section called “Combining characters”. Further, any regular expressions with wildcard characters cannot be expected to be treated uniformly.

Regular Expressions and Combining Characters

Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will be most keenly felt in the use of combining characters in Unicode. Given a string áb = áb (i.e., an acute accent over the a), a search pattern a. will in some search engines include the b and others not.

Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. If you find the need to count characters, and you are working with a language that uses combining characters, you should count only base characters, not combining ones. In fact, TAN assumes that in cases where characters are identified with a numeral, the numeral excludes combining characters. See the section called “Combining characters”. Further, any regular expressions with wildcard characters cannot be expected to be treated uniformly.

TAN includes several functions that usefully extend XML regular expressions. See tan:regex, tan:matches(), tan:replace(), tan:tokenize().

Prev	Up	Next
Assumptions in the Creation of TAN Data	Home	Interpretation of multiple values

Core Technology

Unicode

What is it?

Normalization

Unicode characters with special interpretation

Combining characters

Deprecated Unicode points

Further Reading

eXtensible Markup Language (XML)

What is it?

Schemas and validation

White space

Non-mixed content

Namespaces

What are they?

TAN namespace and prefix

The Text Encoding Initiative

What is it?

Further reading

Data types

Languages

Dates and times

Identifiers and Their Use

Resource Description Framework (RDF) and Linked Open Data

What are they?

TAN and RDF

Further reading

Tag URNs

Regular Expressions