TAN depends upon a set of relatively stable technologies. Those technologies and the underlying terminology are briefly explained below, with attention paid to interpretive decisions that affect validation rules.
Unicode is the worldwide standard for the encoding, representation, and exchange of digital texts. The standard is maintained by a nonprofit consortium whose goal is to represent all the world's writing systems, living and historical. The Unicode standard allows us to share texts in any alphabet, syllabary, or ideographic system reliably, regardless of how that text is rendered (e.g., fonts, display).
With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular script or group of characters. Within each block, characters may be grouped further. Each character is assigned a single number called a codepoint.
Codepoints are numbered according to the hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The decimal number 10 is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long table of sixteen columns, a glyph in each square; this is illustrated nicely in this article.
It is common to refer to Unicode characters by their value and perhaps by their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four hexadecimal characters. When the official Unicode name is given, it is normally in uppercase. Examples:
Table 3.1. Unicode characters
Character | Unicode value | Unicode name |
---|---|---|
" " (space) | U+0020 | SPACE |
® | U+00AE | REGISTERED SIGN |
ю | U+044E | CYRILLIC SMALL LETTER YU |
In an XML file, nearly any Unicode codepoint may be used, either by typing
or pasting the character directly, or by using XML
entities. An XML entity is a proxy for some other text, marked by
an ampersand, some text, and then the semicolon. For example,
&
represents the ampersand and <
stands for <
. To access specific Unicode characters an entity
may start &#x
followed by the hexadecimal codepoint (if you
prefer to work with decimal codepoints, leave off the x
). For
example, the XML hex entity ю
(or
ю
in decimal) is a proxy for the Cyrillic small
letter yu.
Unicode rules provide guidance on how text should be normalized, to identify equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be treated as identical in meaning to the single character ö (U+00F6: LATIN SMALL LETTER O WITH DIAERESIS). There are two codepoints that could be used for the Greek question mark (;), and normalization converts the less preferred codepoint to the other.
TAN validation rules require all data to be normalized according to the Unicode NFC algorithm (the most common of the four normalization methods). Any text in a TAN file that is not NFC normalized will be marked as invalid. A supplied Schematron Quick Fix will let users automatically normalize text (for editing tools such as Oxygen that support Schematron Quick Fixes). This enforcement of NFC normalization helps to make sure that texts are fairly compared.
The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD
SOFT HYPHEN placed at the end of a leaf <div>
, perhaps followed by space that will be
ignored (see below), signal that the text is to be joined with any subsequent
text (i.e., the next leaf <div>
). Accordingly, any TAN function that needs to
extract text from a leaf <div>
structure will delete from the end of its text
the U+200B, U+200D, or U+00AD character and its trailing space. (By contrast,
text from a leaf <div>
that
does not end this way will first be space-normalized, then a single space will
be appended.) Because these special line-end characters are difficult to
distinguish visually from spaces and hyphens, their XML entities,
​
, ‍
, and
­
should be preferred in any XML output.
Much has been written about the different ways U+00AD SOFT HYPHEN has been or should be used and interpreted. Debate will no doubt continue. TAN design assumes that the soft hyphen marks a place in a word where a line break has occurred, is allowed to occur, or both. In situations where the text is printed or displayed, any soft hyphen that does not mark a word broken by a line should not be displayed.
At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently, or as part of a base character, nor do core XML technologies. In most cases, this point is negligible. But it can affect regular expressions and XPath expressions (see below).
Two of the class-2 formats allow the counting of characters. Such counting
is assumed to be made exclusively of individual base (non-combining) characters
(each perhaps followed by one or more combining characters). Therefore one
character is defined as the regular expression \P{M}\p{M}*
, bound
to global variable the section called “$tan:char-regex
”. Any numerical
reference made in a TAN file to an individual character, i.e., through
@chars
, is
interpreted by counting only non-combining characters. When the nth character
is requested, TAN functions will return the nth base character along with any
combining characters that immediately follow.
For example, a̳b̈́c͠d consists of four base characters, interleaved with
three combining characters, technically seven total. But @chars
, which counts characters,
there are a maximum of four characters. A value of 1 picks both the base
character and its combining character, a̳.
TAN rules stipulate that combining characters must have a preceding base
character. Any <div>
that,
after any initial space, starts with a combining character will be marked as
invalid. See also Regular Expressions and Combining Characters.
Because TAN files are not scriptum-oriented (see the section called “Domain model”), the following characters will generate an error if found in a TAN file:
U+00A0 NO-BREAK SPACE
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
Defined by the W3C, the eXtensible Markup Language (XML) is a markup language that that can be extended to allow anyone to define the structure and rules of a document type. For a quick, simple introduction to XML see Chapter 2, Starting off with the TAN format. XML is one of many formats that can be described as tree-based formats. Others include JSON, HTML, YAML, and Markdown. All of the preceding formats can be expressed in XML, but not the other way around. This does not mean that XML is inherently superior. (For some purposes, it is overkill.) But it does mean that XML is the lingua franca for treelike data structures. For more on the relationship between XML and other treelike formats, especially JSON, see the Invisible Markup Community Group.
TAN validation files are found in the schemas
subdirectory.
Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type, written in RELAX-NG, the other with more complex, detailed rules, written in Schematron.
The RELAX-NG rules are written primarily in compact syntax
(*.rnc
), and then converted to XML syntax (*.rng
).
For TAN-TEI, the special format One Document Does it all
(TAN-TEI.odd
) is used to adjust the rules for TEI All. The ODD
file is then processed by TEI stylesheets into compact and XML RELAX-NG
formats.
The Schematron files are generally quite short. The primary work is done by an extensive function library written in XSLT. For the most part, the Schematron files arbitrate between the file and the validation results calculated by the TAN function library. For a detailed overview of this process, see the section called “TAN validation”.
Some validation engines that process a valid TAN-compliant TEI file may
return an error such as conflicting ID-types for attribute "who" of
element "comment" from namespace "tag:textalign.net,2015:ns"
. Such a
message alerts you to the fact that by mixing TEI and TAN namespaces, you open
yourself up to the possibility of conflicting xml:id
values. It is
your responsibility to ensure that you have not assigned duplicate identifiers.
An XML editor may be configured to ignore this discrepancy. (In Oxygen XML
editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck
the box ID/IDREF.)
By default in XML, unless otherwise specified, consecutive space characters
(space, tab, newline, and carriage return) are considered equivalent to a
single space. This gives editors the freedom to format XML documents as they
like, balancing human readability against compactness. In XML, space normalization is performed by stripping leading
and trailing whitespace and replacing sequences of one or more whitespace
character with a single space,  
.
All TAN formats assume space normalization, with an extra caveat for leaf
<div>
s. Initial space
is always stripped. If a leaf <div>
ends in the soft hyphen or the zero width joiner
(see the section called “Unicode characters with special interpretation”) the
character is suppressed along with any ending space, otherwise the text is
normalized to end in a single space character (whether or not there are space
characters in the leaf <div>
itself).
If retention of multiple spaces or spaces of specific sizes is important for your files and research, then you should not be working with the TAN format, which cannot be used to replicate the appearance of a scriptum (see the section called “Domain model”). Pure TEI (and not TAN-TEI) is a better alternative, since it allows for a literal use of space, and supports the creation of scriptum-oriented XML files. Once you finish with that scriptum-oriented transcription, you might be ready to prepare a second one oriented toward intertextual analysis, at which point TAN would be ideal.
For more on space see guidance in the W3C recommendation.
In many popular XML formats such as TEI, XHTML, and Docbook some elements
allow a mixture of elements and nonspace text as children, e.g.,
<div>Some <span>text</span></div>
. These are called
mixed content models. The TAN formats,
aside from TAN-TEI, are committed to a non-mixed
content model, e.g., <div><span>Some
</span><span>text</span></div>
. Nonspace text nodes and
elements are never siblings. The practical effect of this decision is TAN files
may be indented as you like, and whitespace text may be placed anywhere,
without altering the meaning. The exception are TAN-TEI files, which allow any
kind of TEI constructions, including mixed content. Many projects do not
consider the implications of how they render space, however, and you should
study the topic closely.
An expanded TAN file (see the section called “TAN validation”) may include what we term a semi-mixed content model, in which any element may have one and only one nonspace text node along with any children elements. That nonspace text node may appear at the beginning or the end of the children nodes. This applies only to the expansion of TAN files, not to TAN files themselves.
XML allows users to create document types of whatever kind. One person may
wish to use the element <band>
to refer to a musical group;
another might use this element to encode radio frequencies. Perhaps someone
wishes to mention a musical group and a radio frequency in the same document,
which would entail mixing two very different types of elements, each named
band
. XML allows users to mix vocabularies, even when those
vocabularies use the same element names. Disambiguation is accomplished by
associating an element name with a kind of family name. That family name is an
IRI (see the section called “Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)” below). The actual full name of
an element, then, is the local name plus the IRI that qualifies its meaning,
e.g., band{http://music-example.com/terms/}
and
band{http://frequency-example.com/terms/}
.
The IRI—the family name—is called the namespace, a term that might seem vague or confusing. It has nothing to do with space. It is merely a term of art to qualify a name. In the world there are many cities that have the same name. We use the name of the state, region, or even country to explain which city we mean. As region names are to city names, so namespaces are to element (and some attribute) names.
Namespaces can be declared in an XML document. When they appear, they look a
lot like attributes. (They aren't.) They take the form
xmlns="http://music-example.com/terms/"
(this defines the
default namespace) or
xmlns:[PREFIX]="http://frequency-example.com/terms/"
(this
assigns a namespace to a prefix) placed inside an opening tag. For example,
<band xmlns="http://music-example.com/terms/">...</band>
declares http://music-example.com/terms/
to be the default
namespace for <band>
and all descendants, unless explicitly
overridden.
To return to our example, different <band>
s can be combined
through namespaces:
<band xmlns="http://music-example.com/terms/"> <band xmlns="http://radio-frequency-example.com/terms/"> ... </band> </band> <band xmlns="http://music-example.com/terms/" xmlns:e2="http://radio-frequency-example.com/terms/"> <e2:band > ... </e2:band> </band> <e1:band xmlns:e1="http://music-example.com/terms/" xmlns:e2="http://radio-frequency-example2.com/terms/"> <e2:band > ... </e2:band> </e1:band>
Namespaces allow us to mix elements as we like. But it also means that when you point to, or refer to an element, you should always be aware of what its namespace is.
The TAN namespace is tag:textalign.net,2015:ns
. The recommended
prefix is tan
. The namespace does
not change from one version of TAN to another.
The TAN-TEI format uses as its default the TEI namespace, http://www.tei-c.org/ns/1.0
, normally given the
prefix tei
. But in a TAN-TEI
file, the head
and its descendants are in the TAN
namespace.
All TAN functions and core global parameters and variables are set in the TAN namespace.
The Text Encoding Initiative (TEI; http://www.tei-c.org/index.xml) is consortium of scholars and scholarly organizations that maintains the rules and documentation behind a collection of XML formats intended for encoding texts. TEI files have been used widely by libraries, museums, publishers, and individual scholars to prepare and publish texts for online research, teaching, and preservation. In addition to the guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software.
TEI provided the impetus for the creation of TAN, and continues to inspire its development. TEI was designed to be highly customizable, to suit the needs of individuals or communities of practice. One of the TAN formats, TAN-TEI, is one such customization, based as it is on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release.
TAN-TEI files and standard, out-of-the-box TEI All files are not automatically interchangeable. TAN-TEI expects all metadata to be human- and computer-readable, whereas TEI metadata is geared primarily to human readability. TAN-TEI tightly regulates the structure of the text, whereas TEI allows for a variety of structures. In any conversion process to and from TEI and TAN-TEI, some human intervention may be required, and conversion in either direction may entail loss.
For more about the strictures placed upon the TEI All schema see the section called “Transcriptions using the Text Encoding Initiative (<TEI>
)”. See also Chapter 4, Common patterns and structures and Chapter 5, Class-1 TAN files, representations of textual objects
(scripta).
Being written purely in XML technologies, TAN uses data types defined in the W3C's official specifications, e.g., strings, booleans, integers. The following data types require some special comments.
TAN adopts for language identification Best Common Practices (BCP) 47, which
standardizes identifiers for languages and scripts. For most users of TAN, this
will be a simple two- or three-letter abbreviation, sometimes supplemented with
a hyphen and an abbreviation designating a script or regional subtag. For
example, eng
, eng-UK
, and eng-UK-Cyrl
refer, respectively, to English (in general), English from the United Kingdom,
and English from the United Kingdom written in the Cyrillic script. As a
general rule, values of this type should begin with a three-letter language
code, preferably lowercase. (The two-letter codes cover only a few dozen
languages; the three-letter codes support thousands of them.)
ISO codes for human languages appear in @xml:lang
and <for-lang>
. The former states
what language the enclosed text is in. The latter is an empty element that
simply points to a specific language. For example, <for-lang>
in the context of
a TAN-mor file indicates which languages the file was written for.
TAN has several global variables and functions useful for working with language codes. See the section called “language”.
For dates and dates + times, TAN adopts the corresponding XML data types, which follow ISO syntax. That syntax begins with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest).
The simplest date takes this form: YYYY-MM-DD
. If a time is
included, it is specified by continuing the string, first with a T
(for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm)
. For example,
the following is 2016-09-20T20:38:27.141-04:00
is an ISO date-time
for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone.
TAN makes extensive use of the following identifiers:
IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987
URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986
URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource.
URN: Uniform Resource Name, a term that
originally referred to persistent names that used a bare
urn:
scheme, but is now applied to a variety of systems
that have registered with the IANA. URNs are generally best thought of as
a subset of URIs.
UUID: Universally Unique Identifier, a
computer-generated 128-bit number that may be attached as an identifier
to any entity. UUIDs can be built into a URN by prefixing them with
urn:
.
The TAN format makes extensive use of all the above. See also the section called “Tag URNs”.
Identifiers are used in many contexts for many purposes. One such purpose is called Linked Open Data (LOD), also known as the Semantic Web, which aims to allow cross-project interoperability of data. It relies upon a very simple data model called Resource Description Framework (RDF), recommended by the World Wide Web Consortium (W3C). The term "Resource"—the R in RDF—refers to any person, place, concept—anything at all, whether you think of it as a resource or not. "Description" is overly specific, too, since RDF was designed to support general assertions, descriptive or not. Perhaps it is easiest to think of RDF as a standardized way to make assertions, as if the name were simply "Assertion Framework." It is a way to make claims about things in the world.
The RDF data model rests upon the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that name things. The object may take an identifier or just data. As people independently identify concepts with the same URLs, they create RDF datasets can be combined, synthesized, and compared. RDF statements found across the web allow inferences no individual project could ever anticipate.
The Semantic Web recommends the use of URLs as identifiers. That way, if
a computer encounters a URL naming a concept, it can be programmed go to the
web resource and retrieve other RDF statements, recursively. So URL
identifiers look like a web page address (e.g., http://...
),
but they are first and foremost names for things. Ideally, those URLs will
still name those things after the domain name expires and the web resource
cannot be found.
Although RDF statements must be made of only three components, it is possible in a roundabout way to create more complex assertions. In one technique, the assertion itself is given a URL, and then RDF statements are made about the assertion. Such assertions are in some cases not easily integrated with other RDF statements. Users who query an RDF database will not find relevant complex RDF statements unless they build their queries to anticipate such situations (or the query engine has been customized).
Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly ways to read and write RDF. For example, consider how one might express "Person X's name is 'Dave Smith'." Compare this snippet (taken from http://linkeddatabook.com/editions/1.0/), written in Turtle, the RDF syntax generally regarded as the most human-readable, ...
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/dave-smith> rdf:type foaf:Person ; foaf:name "Dave Smith" .
...with the TAN equivalent:
<person> <IRI>http://biglynx.co.uk/people/dave-smith</IRI> <name>Dave Smith</name> </person>
These TAN and RDF expressions are interchangeable.
But in more complex claims, it is, at this time, not clear whether all
assertions in TAN can be losslessly converted to the RDF model. Every
class-2 file makes a claim about the text, and there must always be attached
to the claim someone that must be blamed or credited for the assertion. TAN
also permits such claims to be modified through traditional adverbs. This is
best seen in the TAN-A <claim>
, which allows a person to nuance a claim to a
degree that is difficult or impossible to express in traditional RDF. For
example, RDF does not allow one to say "Person X is not the author of text
Y," but TAN does.
TAN claims can also be quite complex. Whereas the standard RDF claim consists of three components—subject, predicate, object—most TAN claims have more. Every TAN claim must have at the minimum: a claimant (no RDF counterpart; the person, organization, or algorithm that asserts the claim), a subject (counterpart to RDF subject), and a verb (counterpart to RDF predicate). Verbs can be defined to permit, require, or disallow other claim components, such as adverbs or objects, many of which are permitted by default. Most TAN claims involve more than three components, so converting a TAN claim to RDF requires creating a complex RDF statement. In many cases, this requires the use of RDF* instead of RDF (link below).
Many TAN claims involve textual subjects or objects. References to parts of text can be quite complex, and they must be made with reference to other entities. It doubtful whether a given specific textual subject or object can be satisfactorily reduced to an unambiguous IRI, because such an IRI would need to include a mechanism to resolve the meaning of the syntax. Such an IRI must not only explain the work's reference system, but also identify the chosen version, scriptum, and perhaps token definition and numeration system. Many texts have more than one "canonical" reference system, so an IRI might point to two different textual passages, thereby breaking a cardinal rule of IRIs: although an entity may be given multiple IRIs, it is never acceptable for an IRI to be ambiguous. There is, at present, no widely accepted solution to this problem, although attempts have been made through CTS URNs and DTS URNs.
For more details see the section called “General annotations and alignments (<TAN-A>)” and <claim>
.
TAN files make extensive use of tag URNs (see the section called “Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)”). In fact, TAN's namespace is itself a tag URN (the section called “Namespaces”). A tag URN has two parts:
Namespace.
tag:
+ an e-mail address or domain name owned by the
person or organization that has authorized the creation of the TAN
file + ,
+ an arbitrary day on which that address or
domain name was owned + :
. The day is expressed in the
form YYYY-MM-DD
, YYYY-MM
, or
YYYY
. A missing MM
or DD
is
implicitly assigned the value of 01
.
Name of the subject. An arbitrary
string (unique to the namespace chosen) chosen by the namespace owner
as a label for subject (e.g., the file, a work, a scriptum). If you
are providing a tag URN for a TAN file, that name can be the same as
the filename, but it is a good practice not to do so, because
filenames need to be changed. You should pick a name that is at least
somewhat intelligible to human readers. It is a good idea to build a
name via categories, from most general to most specific. For example
tag:pat@example.com,2014:work:aristotle-pseudo:secreta-secretorum
might be used as an IRI to name the work the Secret of
Secrets attributed to Aristotle. A TAN file that
transcribes a particular version of this text might look like this:
tag:pat@example.com,2014:transcription:scriptum:badawi-1954:work:secrets
.
Although you may use any tag URN coined by someone else, when you create a tag URN, you may use only namespaces you own or owned.
Care should be taken in choosing the name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple identifiers, but never acceptable for an identifier to name more than one thing. It is a good practice to keep a master checklist of tag URNs you have created. If you find yourself forgetting, or think you run the risk of creating duplicate tag URNs, you should start afresh by creating a new namespace for your tag URNs, if only by changing the date in the tag URN namespace.
Example 3.1. Tag URNs
tag:jan@example.com,1999-01-31:TAN-T001 tag:example.com,2001-04:work:usc22.1 tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts tag:bbrb@example.org,1995-04-01:pos-grc
The first example comes from someone who owned the email address
jan@example.com
on January 31, 1999 (at the stroke of
midnight, Universal Coordinated Time). The other examples follow a
similar logic. The namespace of the second and third examples are tied to
the owners of specific domain names. The 2014
in the third
example is shorthand for the first second of January 1, 2014.
TAN files are identified and named via tag URNs, not URLs, for several reasons:
Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries from now, well after most domain names today have changed ownership or fallen into obsolesence, and well after the creators are dead. URLs are not designed for such longevity.
Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, a group of persons, or an algorithm). A tag URN connects the identifier with the responsible person or group. URLs cannot identify the person or organization responsible for the name.
Accessibility. Tag URNs have almost no barriers. They can be created by anyone who has an email address. No one has to register with a central authority. You can begin naming anything you want, any time you want, without anyone's approval, and without paying anything.
Ease. Tag URNs are easy to use. All you need is an email address, which is very easy to get. You can use a domain name too, but many potential TAN authors never have owned a domain name, and never will, barring them from creating or publishing linked open data under the classic model, where you coin URLs in a domain you own. Many of those who do own domain names cannot or do not wish to configure, populate, maintain, and troubleshoot servers with the referral mechanisms recommended by Semantic Web advocates (see the section called “Resource Description Framework (RDF) and Linked Open Data”).
Scholarly citation norms. In the Semantic Web, the conflation of URL qua name with URL qua location is considered by many a virtue because the single string does double duty, both naming the resource and pointing to a location where more can be learned. Although the combination is elegant from the perspective of an engineer, it is confusing to many others. URLs are commonly thought to be merely locations for data, not names for things. It also goes against an important principle in scholarly citation practices, namely, the name of a publication should always be distinguished from where it might be found.
Further reading:
RFC 4151, the official definition of tag URNs
Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, alluding to the Latin root regula (rule), it refers to a rule-based method of finding and replacing text through patterns. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Functions 3.1.
Caution | |
---|---|
XML Schema Datatypes define regular expressions differently than do Perl,
one of the most common forms of regular expression. For example, the pipe
symbol, |, is treated as a word character in XML regular expressions
( Word characters ( Non-word characters ( The placement of some of these characters may seem to you counterintuitive or wrong. But at this point complaining will not change the conventions. Any apparent mistakes are definitive ones. Just familiarize yourself with the conventions. |
A regular expression search pattern is treated just like a normal search
pattern until the computer reaches a special character: . [ ] \ | ^ $ ? * +
{ } ( )
. Here is a brief key to how those special characters behave in
regular expressions when they are first found. (Some of these special characters
change their meaning if they are found inside square brackets; on this point, see
the recommended reading below):
Table 3.2. Special characters in regular expressions
Symbol | Meaning |
---|---|
. | any character |
| | or (union) |
^ | start of line or string (doesn't capture any characters) |
? | zero or one |
* | zero or more |
+ | one or more |
[ ] | a class of characters |
( ) | a group |
^ | beginning of a line or string (doesn't capture any characters) |
$ | end of a line or string (doesn't capture any characters) |
If you need to use any of those special characters as characters in their
own right, then you need to escape them, by prefixing the character with an escape
character, \.
Table 3.3. Special characters in regular expressions
Symbol | Meaning |
---|---|
\\ | backslash (an escaped escape character) |
\^ | a caret sign (must be escaped with the \) |
\$ | dollar sign (escaped) |
\( | opening parenthesis (escaped) |
\[ | opening square bracket (escaped) |
The escape character appearing before some letters accesses certain classes of characters:
Table 3.4. Special characters in regular expressions
Symbol | Meaning |
---|---|
\w | any word character |
\W | any nonword character |
\s | any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D) |
\S | anything not a spacing character |
\d | any digit (0-9) |
\D | anything not a digit |
\p{IsGujarati} | any character from the Unicode block named Gujarati |
Some examples of regular expressions:
Table 3.5. Examples of Regular Expressions
Expression | Meaning | What the expression matches when applied to "Wi-fi, good. A_hem* isn't!" |
---|---|---|
^.+$ | one whole line of characters | "Wi-fi, good. A_hem* isn't!" |
[ae] | a or e | "e" |
[a-e] | a, b, c, d, or e | "d", "e" |
[^ae]+ | one or more characters that are anything except a or e | "Wi-fi, good. A_h", "m* isn't!" |
.i | any character followed by i. | "Wi", "fi", " i" |
(.i) | when a character followed by an i is found treat it as a capture group (used only in a search pattern) | "Wi", "fi", " i" |
[aeiou]\w* | any lowercase vowel along with every word character that follows | "i", "i", "ood", "em", "isn" |
[t*]. | any t or * and the following character | "* ", "t!" Note that the asterisk, if inside a character class, represents itself. |
\s+ | one or more space characters | " ", " ", " " |
\w+ | one or more word characters | "Wi", "fi", "good", "A_hem", "isn", "t" |
\W+ | match one or more nonword characters | "-", ", ", ". ", "* ", "'", "!" |
[^q]+ | one or more characters that are not a q | "Wi-fi, good. A_hem* isn't!" |
The examples above provide a taste of how regular expressions are constructed and read.
Regular Expressions and Combining Characters | |
---|---|
A regular expressions might be ambiguous in the context of combining
characters. Suppose we have a string of three characters, áb (i.e., an acute
accent over the a; the codepoints are, in XML entities,
Unicode has differentiated three levels of support for regular expressions (see official report). Only level-one conformance in XPath and therefore TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see the section called “Combining characters”). |
TAN includes several functions that usefully extend XML regular expressions. See the section called “regular expressions”.
Further reading: