Core technology

Core technology
Prev	Chapter 3. General underpinnings	Next

Unicode

What is it?

Unicode is the worldwide standard for the encoding, representation, and exchange of digital texts. The standard is maintained by a nonprofit consortium whose goal is to represent all the world's writing systems, living and historical. The Unicode standard allows us to share texts in any alphabet, syllabary, or ideographic system reliably, regardless of how that text is rendered (e.g., fonts, display).

With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular script or group of characters. Within each block, characters may be grouped further. Each character is assigned a single number called a codepoint.

Codepoints are numbered according to the hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The decimal number 10 is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long table of sixteen columns, a glyph in each square; this is illustrated nicely in this article.

It is common to refer to Unicode characters by their value and perhaps by their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four hexadecimal characters. When the official Unicode name is given, it is normally in uppercase. Examples:

Table 3.1. Unicode characters

Character	Unicode value	Unicode name
" " (space)	U+0020	SPACE
®	U+00AE	REGISTERED SIGN
ю	U+044E	CYRILLIC SMALL LETTER YU

In an XML file, nearly any Unicode codepoint may be used, either by typing or pasting the character directly, or by using XML entities. An XML entity is a proxy for some other text, marked by an ampersand, some text, and then the semicolon. For example, & represents the ampersand and < stands for <. To access specific Unicode characters an entity may start &#x followed by the hexadecimal codepoint (if you prefer to work with decimal codepoints, leave off the x). For example, the XML hex entity ю (or ю in decimal) is a proxy for the Cyrillic small letter yu.

Unicode normalization

Unicode rules provide guidance on how text should be normalized, to identify equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be treated as identical in meaning to the single character ö (U+00F6: LATIN SMALL LETTER O WITH DIAERESIS). There are two codepoints that could be used for the Greek question mark (;), and normalization converts the less preferred codepoint to the other.

TAN validation rules require all data to be normalized according to the Unicode NFC algorithm (the most common of the four normalization methods). Any text in a TAN file that is not NFC normalized will be marked as invalid. A supplied Schematron Quick Fix will let users automatically normalize text (for editing tools such as Oxygen that support Schematron Quick Fixes). This enforcement of NFC normalization helps to make sure that texts are fairly compared.

Unicode characters with special interpretation

The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD SOFT HYPHEN placed at the end of a leaf <div>, perhaps followed by space that will be ignored (see below), signal that the text is to be joined with any subsequent text (i.e., the next leaf <div>). Accordingly, any TAN function that needs to extract text from a leaf <div> structure will delete from the end of its text the U+200B, U+200D, or U+00AD character and its trailing space. (By contrast, text from a leaf <div> that does not end this way will first be space-normalized, then a single space will be appended.) Because these special line-end characters are difficult to distinguish visually from spaces and hyphens, their XML entities, , ‍, and  should be preferred in any XML output.

Much has been written about the different ways U+00AD SOFT HYPHEN has been or should be used and interpreted. Debate will no doubt continue. TAN design assumes that the soft hyphen marks a place in a word where a line break has occurred, is allowed to occur, or both. In situations where the text is printed or displayed, any soft hyphen that does not mark a word broken by a line should not be displayed.

Combining characters

At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently, or as part of a base character, nor do core XML technologies. In most cases, this point is negligible. But it can affect regular expressions and XPath expressions (see below).

Two of the class-2 formats allow the counting of characters. Such counting is assumed to be made exclusively of individual base (non-combining) characters (each perhaps followed by one or more combining characters). Therefore one character is defined as the regular expression \P{M}\p{M}*, bound to global variable the section called “$tan:char-regex”. Any numerical reference made in a TAN file to an individual character, i.e., through @chars, is interpreted by counting only non-combining characters. When the nth character is requested, TAN functions will return the nth base character along with any combining characters that immediately follow.

For example, a̳b̈́c͠d consists of four base characters, interleaved with three combining characters, technically seven total. But @chars, which counts characters, there are a maximum of four characters. A value of 1 picks both the base character and its combining character, a̳.

TAN rules stipulate that combining characters must have a preceding base character. Any <div> that, after any initial space, starts with a combining character will be marked as invalid. See also Regular Expressions and Combining Characters.

Unicode points not allowed

Because TAN files are not scriptum-oriented (see the section called “Domain model”), the following characters will generate an error if found in a TAN file:

U+00A0 NO-BREAK SPACE
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE

eXtensible Markup Language (XML)

What is it?

Defined by the W3C, the eXtensible Markup Language (XML) is a markup language that that can be extended to allow anyone to define the structure and rules of a document type. For a quick, simple introduction to XML see Chapter 2, Starting off with the TAN format. XML is one of many formats that can be described as tree-based formats. Others include JSON, HTML, YAML, and Markdown. All of the preceding formats can be expressed in XML, but not the other way around. This does not mean that XML is inherently superior. (For some purposes, it is overkill.) But it does mean that XML is the lingua franca for treelike data structures. For more on the relationship between XML and other treelike formats, especially JSON, see the Invisible Markup Community Group.

Schemas and validation

TAN validation files are found in the schemas subdirectory.

Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type, written in RELAX-NG, the other with more complex, detailed rules, written in Schematron.

The RELAX-NG rules are written primarily in compact syntax (*.rnc), and then converted to XML syntax (*.rng). For TAN-TEI, the special format One Document Does it all (TAN-TEI.odd) is used to adjust the rules for TEI All. The ODD file is then processed by TEI stylesheets into compact and XML RELAX-NG formats.

The Schematron files are generally quite short. The primary work is done by an extensive function library written in XSLT. For the most part, the Schematron files arbitrate between the file and the validation results calculated by the TAN function library. For a detailed overview of this process, see the section called “TAN validation”.

Some validation engines that process a valid TAN-compliant TEI file may return an error such as conflicting ID-types for attribute "who" of element "comment" from namespace "tag:textalign.net,2015:ns". Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. An XML editor may be configured to ignore this discrepancy. (In Oxygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

Space characters and normalization

By default in XML, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom to format XML documents as they like, balancing human readability against compactness. In XML, space normalization is performed by stripping leading and trailing whitespace and replacing sequences of one or more whitespace character with a single space,  .

All TAN formats assume space normalization, with an extra caveat for leaf <div>s. Initial space is always stripped. If a leaf <div> ends in the soft hyphen or the zero width joiner (see the section called “Unicode characters with special interpretation”) the character is suppressed along with any ending space, otherwise the text is normalized to end in a single space character (whether or not there are space characters in the leaf <div> itself).

If retention of multiple spaces or spaces of specific sizes is important for your files and research, then you should not be working with the TAN format, which cannot be used to replicate the appearance of a scriptum (see the section called “Domain model”). Pure TEI (and not TAN-TEI) is a better alternative, since it allows for a literal use of space, and supports the creation of scriptum-oriented XML files. Once you finish with that scriptum-oriented transcription, you might be ready to prepare a second one oriented toward intertextual analysis, at which point TAN would be ideal.

For more on space see guidance in the W3C recommendation.

Mixed, non-mixed, and semi-mixed content

An expanded TAN file (see the section called “TAN validation”) may include what we term a semi-mixed content model, in which any element may have one and only one nonspace text node along with any children elements. That nonspace text node may appear at the beginning or the end of the children nodes. This applies only to the expansion of TAN files, not to TAN files themselves.

Namespaces

What are they?

XML allows users to create document types of whatever kind. One person may wish to use the element <band> to refer to a musical group; another might use this element to encode radio frequencies. Perhaps someone wishes to mention a musical group and a radio frequency in the same document, which would entail mixing two very different types of elements, each named band. XML allows users to mix vocabularies, even when those vocabularies use the same element names. Disambiguation is accomplished by associating an element name with a kind of family name. That family name is an IRI (see the section called “Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)” below). The actual full name of an element, then, is the local name plus the IRI that qualifies its meaning, e.g., band{http://music-example.com/terms/} and band{http://frequency-example.com/terms/}.

The IRI—the family name—is called the namespace, a term that might seem vague or confusing. It has nothing to do with space. It is merely a term of art to qualify a name. In the world there are many cities that have the same name. We use the name of the state, region, or even country to explain which city we mean. As region names are to city names, so namespaces are to element (and some attribute) names.

Namespaces can be declared in an XML document. When they appear, they look a lot like attributes. (They aren't.) They take the form xmlns="http://music-example.com/terms/" (this defines the default namespace) or xmlns:[PREFIX]="http://frequency-example.com/terms/" (this assigns a namespace to a prefix) placed inside an opening tag. For example, <band xmlns="http://music-example.com/terms/">...</band> declares http://music-example.com/terms/ to be the default namespace for <band> and all descendants, unless explicitly overridden.

To return to our example, different <band>s can be combined through namespaces:

<band xmlns="http://music-example.com/terms/">
    <band xmlns="http://radio-frequency-example.com/terms/">
        ...
    </band>
</band>

<band xmlns="http://music-example.com/terms/" 
    xmlns:e2="http://radio-frequency-example.com/terms/">
    <e2:band >
        ...
    </e2:band>
</band>

<e1:band xmlns:e1="http://music-example.com/terms/" 
    xmlns:e2="http://radio-frequency-example2.com/terms/">
    <e2:band >
        ...
    </e2:band>
</e1:band>

Namespaces allow us to mix elements as we like. But it also means that when you point to, or refer to an element, you should always be aware of what its namespace is.

TAN namespace and prefix

The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The namespace does not change from one version of TAN to another.

The TAN-TEI format uses as its default the TEI namespace, http://www.tei-c.org/ns/1.0, normally given the prefix tei. But in a TAN-TEI file, the head and its descendants are in the TAN namespace.

All TAN functions and core global parameters and variables are set in the TAN namespace.

The Text Encoding Initiative

What is it?

The Text Encoding Initiative (TEI; http://www.tei-c.org/index.xml) is consortium of scholars and scholarly organizations that maintains the rules and documentation behind a collection of XML formats intended for encoding texts. TEI files have been used widely by libraries, museums, publishers, and individual scholars to prepare and publish texts for online research, teaching, and preservation. In addition to the guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software.

TEI provided the impetus for the creation of TAN, and continues to inspire its development. TEI was designed to be highly customizable, to suit the needs of individuals or communities of practice. One of the TAN formats, TAN-TEI, is one such customization, based as it is on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release.

TAN-TEI files and standard, out-of-the-box TEI All files are not automatically interchangeable. TAN-TEI expects all metadata to be human- and computer-readable, whereas TEI metadata is geared primarily to human readability. TAN-TEI tightly regulates the structure of the text, whereas TEI allows for a variety of structures. In any conversion process to and from TEI and TAN-TEI, some human intervention may be required, and conversion in either direction may entail loss.

For more about the strictures placed upon the TEI All schema see the section called “Transcriptions using the Text Encoding Initiative (<TEI>)”. See also Chapter 4, Common patterns and structures and Chapter 5, Class-1 TAN files, representations of textual objects (scripta).

Data types

Being written purely in XML technologies, TAN uses data types defined in the W3C's official specifications, e.g., strings, booleans, integers. The following data types require some special comments.

Languages

TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes identifiers for languages and scripts. For most users of TAN, this will be a simple two- or three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English (in general), English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase. (The two-letter codes cover only a few dozen languages; the three-letter codes support thousands of them.)

ISO codes for human languages appear in @xml:lang and <for-lang>. The former states what language the enclosed text is in. The latter is an empty element that simply points to a specific language. For example, <for-lang> in the context of a TAN-mor file indicates which languages the file was written for.

TAN has several global variables and functions useful for working with language codes. See the section called “language”.

Dates and times

For dates and dates + times, TAN adopts the corresponding XML data types, which follow ISO syntax. That syntax begins with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest).

The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone.

Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)

TAN makes extensive use of the following identifiers:

IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987
URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986
URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource.
URN: Uniform Resource Name, a term that originally referred to persistent names that used a bare urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs.
UUID: Universally Unique Identifier, a computer-generated 128-bit number that may be attached as an identifier to any entity. UUIDs can be built into a URN by prefixing them with urn:.

The TAN format makes extensive use of all the above. See also the section called “Tag URNs”.

Resource Description Framework (RDF) and Linked Open Data

What are they?

Identifiers are used in many contexts for many purposes. One such purpose is called Linked Open Data (LOD), also known as the Semantic Web, which aims to allow cross-project interoperability of data. It relies upon a very simple data model called Resource Description Framework (RDF), recommended by the World Wide Web Consortium (W3C). The term "Resource"—the R in RDF—refers to any person, place, concept—anything at all, whether you think of it as a resource or not. "Description" is overly specific, too, since RDF was designed to support general assertions, descriptive or not. Perhaps it is easiest to think of RDF as a standardized way to make assertions, as if the name were simply "Assertion Framework." It is a way to make claims about things in the world.

The RDF data model rests upon the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that name things. The object may take an identifier or just data. As people independently identify concepts with the same URLs, they create RDF datasets can be combined, synthesized, and compared. RDF statements found across the web allow inferences no individual project could ever anticipate.

The Semantic Web recommends the use of URLs as identifiers. That way, if a computer encounters a URL naming a concept, it can be programmed go to the web resource and retrieve other RDF statements, recursively. So URL identifiers look like a web page address (e.g., http://...), but they are first and foremost names for things. Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found.

Although RDF statements must be made of only three components, it is possible in a roundabout way to create more complex assertions. In one technique, the assertion itself is given a URL, and then RDF statements are made about the assertion. Such assertions are in some cases not easily integrated with other RDF statements. Users who query an RDF database will not find relevant complex RDF statements unless they build their queries to anticipate such situations (or the query engine has been customized).

TAN claims and RDF

Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly ways to read and write RDF. For example, consider how one might express "Person X's name is 'Dave Smith'." Compare this snippet (taken from http://linkeddatabook.com/editions/1.0/), written in Turtle, the RDF syntax generally regarded as the most human-readable, ...

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix foaf: <http://xmlns.com/foaf/0.1/> . 

<http://biglynx.co.uk/people/dave-smith> 
rdf:type foaf:Person ; 
foaf:name "Dave Smith" .

...with the TAN equivalent:

<person>
   <IRI>http://biglynx.co.uk/people/dave-smith</IRI>
   <name>Dave Smith</name>
</person>

These TAN and RDF expressions are interchangeable.

But in more complex claims, it is, at this time, not clear whether all assertions in TAN can be losslessly converted to the RDF model. Every class-2 file makes a claim about the text, and there must always be attached to the claim someone that must be blamed or credited for the assertion. TAN also permits such claims to be modified through traditional adverbs. This is best seen in the TAN-A <claim>, which allows a person to nuance a claim to a degree that is difficult or impossible to express in traditional RDF. For example, RDF does not allow one to say "Person X is not the author of text Y," but TAN does.

TAN claims can also be quite complex. Whereas the standard RDF claim consists of three components—subject, predicate, object—most TAN claims have more. Every TAN claim must have at the minimum: a claimant (no RDF counterpart; the person, organization, or algorithm that asserts the claim), a subject (counterpart to RDF subject), and a verb (counterpart to RDF predicate). Verbs can be defined to permit, require, or disallow other claim components, such as adverbs or objects, many of which are permitted by default. Most TAN claims involve more than three components, so converting a TAN claim to RDF requires creating a complex RDF statement. In many cases, this requires the use of RDF* instead of RDF (link below).

Many TAN claims involve textual subjects or objects. References to parts of text can be quite complex, and they must be made with reference to other entities. It doubtful whether a given specific textual subject or object can be satisfactorily reduced to an unambiguous IRI, because such an IRI would need to include a mechanism to resolve the meaning of the syntax. Such an IRI must not only explain the work's reference system, but also identify the chosen version, scriptum, and perhaps token definition and numeration system. Many texts have more than one "canonical" reference system, so an IRI might point to two different textual passages, thereby breaking a cardinal rule of IRIs: although an entity may be given multiple IRIs, it is never acceptable for an IRI to be ambiguous. There is, at present, no widely accepted solution to this problem, although attempts have been made through CTS URNs and DTS URNs.

For more details see the section called “General annotations and alignments (<TAN-A>)” and <claim>.

Tag URNs

TAN files make extensive use of tag URNs (see the section called “Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)”). In fact, TAN's namespace is itself a tag URN (the section called “Namespaces”). A tag URN has two parts:

Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned + :. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01.
Name of the subject. An arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for subject (e.g., the file, a work, a scriptum). If you are providing a tag URN for a TAN file, that name can be the same as the filename, but it is a good practice not to do so, because filenames need to be changed. You should pick a name that is at least somewhat intelligible to human readers. It is a good idea to build a name via categories, from most general to most specific. For example tag:pat@example.com,2014:work:aristotle-pseudo:secreta-secretorum might be used as an IRI to name the work the Secret of Secrets attributed to Aristotle. A TAN file that transcribes a particular version of this text might look like this: tag:pat@example.com,2014:transcription:scriptum:badawi-1954:work:secrets.

Although you may use any tag URN coined by someone else, when you create a tag URN, you may use only namespaces you own or owned.

Care should be taken in choosing the name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple identifiers, but never acceptable for an identifier to name more than one thing. It is a good practice to keep a master checklist of tag URNs you have created. If you find yourself forgetting, or think you run the risk of creating duplicate tag URNs, you should start afresh by creating a new namespace for your tag URNs, if only by changing the date in the tag URN namespace.

Example 3.1. Tag URNs

tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:work:usc22.1
tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc

The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names. The 2014 in the third example is shorthand for the first second of January 1, 2014.

TAN files are identified and named via tag URNs, not URLs, for several reasons:

Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries from now, well after most domain names today have changed ownership or fallen into obsolesence, and well after the creators are dead. URLs are not designed for such longevity.
Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, a group of persons, or an algorithm). A tag URN connects the identifier with the responsible person or group. URLs cannot identify the person or organization responsible for the name.
Accessibility. Tag URNs have almost no barriers. They can be created by anyone who has an email address. No one has to register with a central authority. You can begin naming anything you want, any time you want, without anyone's approval, and without paying anything.
Ease. Tag URNs are easy to use. All you need is an email address, which is very easy to get. You can use a domain name too, but many potential TAN authors never have owned a domain name, and never will, barring them from creating or publishing linked open data under the classic model, where you coin URLs in a domain you own. Many of those who do own domain names cannot or do not wish to configure, populate, maintain, and troubleshoot servers with the referral mechanisms recommended by Semantic Web advocates (see the section called “Resource Description Framework (RDF) and Linked Open Data”).
Scholarly citation norms. In the Semantic Web, the conflation of URL qua name with URL qua location is considered by many a virtue because the single string does double duty, both naming the resource and pointing to a location where more can be learned. Although the combination is elegant from the perspective of an engineer, it is confusing to many others. URLs are commonly thought to be merely locations for data, not names for things. It also goes against an important principle in scholarly citation practices, namely, the name of a publication should always be distinguished from where it might be found.

Regular expressions

Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, alluding to the Latin root regula (rule), it refers to a rule-based method of finding and replacing text through patterns. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Functions 3.1.

Caution

	Caution
XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, \|, is treated as a word character in XML regular expressions (`\w`), but the opposite is true for Perl. For convenience, here are the codepoints in the range U+0020..U+00FF that are considered word characters according to XML (and therefore TAN): Word characters (`\w`): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z \| ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Non-word characters (`\W`): `! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ¶ · » ¿` The placement of some of these characters may seem to you counterintuitive or wrong. But at this point complaining will not change the conventions. Any apparent mistakes are definitive ones. Just familiarize yourself with the conventions.

XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the codepoints in the range U+0020..U+00FF that are considered word characters according to XML (and therefore TAN):

Word characters (\w): $ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W): ! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ¶ · » ¿

The placement of some of these characters may seem to you counterintuitive or wrong. But at this point complaining will not change the conventions. Any apparent mistakes are definitive ones. Just familiarize yourself with the conventions.

A regular expression search pattern is treated just like a normal search pattern until the computer reaches a special character: . [ ] \ | ^ $ ? * + { } ( ). Here is a brief key to how those special characters behave in regular expressions when they are first found. (Some of these special characters change their meaning if they are found inside square brackets; on this point, see the recommended reading below):

Table 3.2. Special characters in regular expressions

Symbol	Meaning
`.`	any character
`\|`	or (union)
`^`	start of line or string (doesn't capture any characters)
`?`	zero or one
`*`	zero or more
`+`	one or more
`[ ]`	a class of characters
`( )`	a group
`^`	beginning of a line or string (doesn't capture any characters)
`$`	end of a line or string (doesn't capture any characters)

If you need to use any of those special characters as characters in their own right, then you need to escape them, by prefixing the character with an escape character, \.

Table 3.3. Special characters in regular expressions

Symbol	Meaning
`\\`	backslash (an escaped escape character)
`\^`	a caret sign (must be escaped with the \)
`\$`	dollar sign (escaped)
`\(`	opening parenthesis (escaped)
`\[`	opening square bracket (escaped)

The escape character appearing before some letters accesses certain classes of characters:

Table 3.4. Special characters in regular expressions

Symbol	Meaning
`\w`	any word character
`\W`	any nonword character
`\s`	any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D)
`\S`	anything not a spacing character
`\d`	any digit (0-9)
`\D`	anything not a digit
`\p{IsGujarati}`	any character from the Unicode block named Gujarati

Some examples of regular expressions:

Table 3.5. Examples of Regular Expressions

Expression	Meaning	What the expression matches when applied to "Wi-fi, good. A_hem* isn't!"
`^.+$`	one whole line of characters	"Wi-fi, good. A_hem* isn't!"
`[ae]`	a or e	"e"
`[a-e]`	a, b, c, d, or e	"d", "e"
`[^ae]+`	one or more characters that are anything except a or e	"Wi-fi, good. A_h", "m* isn't!"
`.i`	any character followed by i.	"Wi", "fi", " i"
`(.i)`	when a character followed by an i is found treat it as a capture group (used only in a search pattern)	"Wi", "fi", " i"
`[aeiou]\w*`	any lowercase vowel along with every word character that follows	"i", "i", "ood", "em", "isn"
`[t*].`	any t or * and the following character	"* ", "t!" Note that the asterisk, if inside a character class, represents itself.
`\s+`	one or more space characters	" ", " ", " "
`\w+`	one or more word characters	"Wi", "fi", "good", "A_hem", "isn", "t"
`\W+`	match one or more nonword characters	"-", ", ", ". ", "* ", "'", "!"
`[^q]+`	one or more characters that are not a q	"Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read.

	Regular Expressions and Combining Characters
A regular expressions might be ambiguous in the context of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a; the codepoints are, in XML entities, `áb`). The regular expression `a.` will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level-one conformance in XPath and therefore TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see the section called “Combining characters”).

Regular Expressions and Combining Characters

A regular expressions might be ambiguous in the context of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a; the codepoints are, in XML entities, áb). The regular expression a. will in some search engines include the b and others not.

Unicode has differentiated three levels of support for regular expressions (see official report). Only level-one conformance in XPath and therefore TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see the section called “Combining characters”).

TAN includes several functions that usefully extend XML regular expressions. See the section called “regular expressions”.

Prev	Up	Next
Assumptions in the creation of TAN data	Home	Chapter 4. Common patterns and structures

Core technology

Unicode

What is it?

Unicode normalization

Unicode characters with special interpretation

Combining characters

Unicode points not allowed

Further reading

eXtensible Markup Language (XML)

What is it?

Schemas and validation

Space characters and normalization

Mixed, non-mixed, and semi-mixed content

Namespaces

What are they?

TAN namespace and prefix

The Text Encoding Initiative

What is it?

Further reading

Data types

Languages

Dates and times

Further reading

Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)

Resource Description Framework (RDF) and Linked Open Data

What are they?

TAN claims and RDF

Further reading

Tag URNs

Regular expressions