regex-ext-tan global variables, keys, and functions summarized

Definition: '-'

Does not rely upon global variables, keys, functions, or templates.

`$base64-key`

Definition: ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '\')

Used by function rgx:dec-to-n(), rgx:n-to-dec().

Does not rely upon global variables, keys, functions, or templates.

`$characters-allowed-in-ucd-names-regex`

Definition: '[-#a-zA-Z0-9]'

Does not rely upon global variables, keys, functions, or templates.

`$characters-to-escape-when-converting-string-to-regex`

Definition: '[\.\[\]\\\|\^\$\?\*\+\{\}]'

Used by function rgx:escape().

Does not rely upon global variables, keys, functions, or templates.

`$close-group-symbols-regex`

Definition: [\]\)\}]

Used by function rgx:parse-regex().

Does not rely upon global variables, keys, functions, or templates.

`$composite-marker-regex`

Definition: '\+'

Does not rely upon global variables, keys, functions, or templates.

`$default-ucd-decomp-db`

Definition: rgx:get-ucd-decomp-db()

Used by function rgx:string-to-components(), rgx:string-to-composites().

Relies upon rgx:get-ucd-decomp-db().

`$default-ucd-decomp-simple-db`

Definition: rgx:get-ucd-decomp-simple-db()

No variables, keys, functions, or named templates depend upon this xsl:variable.

Relies upon rgx:get-ucd-decomp-simple-db().

`$default-ucd-names-db`

Definition: rgx:get-ucd-names-db()

Used by function rgx:get-chars-by-name(), rgx:build-char-replacement-guide().

Relies upon rgx:get-ucd-names-db().

`$escapes-in-regex`

Definition: '\\[\.\[\]\\\|\^\$\?\*\+\{\}nrtdDsSiIcCwW\d]|\\[pPu]\{[^\}]*\}'

Used by function rgx:parse-regex().

Does not rely upon global variables, keys, functions, or templates.

`$hex-key`

Definition: ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F')

Used by function rgx:dec-to-n(), rgx:n-to-dec(), rgx:hex-to-dec().

Does not rely upon global variables, keys, functions, or templates.

`$name-marker-regex`

Definition: '[\.!]'

Does not rely upon global variables, keys, functions, or templates.

`$open-group-symbols-regex`

Definition: [\[\(\{]

Used by function rgx:parse-regex().

Does not rely upon global variables, keys, functions, or templates.

`$u-item-delimiter-regex`

Definition: ' '

Does not rely upon global variables, keys, functions, or templates.

`$unicode-versions-supported`

Definition: 5.1, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0

Used by function rgx:best-unicode-version().

Does not rely upon global variables, keys, functions, or templates.

Functions

`rgx:analyze-string()`

rgx:analyze-string($input as xs:string?, $pattern as xs:string) as element()

Used by function rgx:analyze-string().

Relies upon rgx:analyze-string().

Option 2 (regex-ext-tan-functions)

rgx:analyze-string($input as xs:string?, $pattern as xs:string, $flags as xs:string) as element()

Used by function rgx:analyze-string().

Relies upon rgx:parse-flags(), rgx:regex().

`rgx:best-unicode-version()`

rgx:best-unicode-version($version as xs:double?) as xs:double

Input: a double representing a Unicode version

Output: the best version supported

Used by function rgx:get-ucd-decomp-db(), rgx:get-ucd-decomp-simple-db(), rgx:get-ucd-names-db(), rgx:parse-flags().

Relies upon $unicode-versions-supported.

`rgx:build-char-replacement-guide()`

rgx:build-char-replacement-guide($words-in-name-to-drop as xs:string*, $words-in-replacement-char-name as xs:string*, $words-not-in-replacement-char-name as xs:string*, $search-is-strict as xs:boolean?, $version as xs:double) as element()

Input: three sequences of strings; a boolean (whether matches should be strict); a 
double (Unicode version)

Output: an XML tree rgx:replace/rgx:char/rgx:with specifying that every 
rgx:char/@val should be replaced by a string-joining of its rgx:with/@val..

This function should be used to optimize replacement through a global variable. See 
documentation at rgx:replace-by-char-name(), which this function supports.

Used by function rgx:replace-by-char-name().

Relies upon $default-ucd-names-db, rgx:get-chars-by-name(), rgx:get-ucd-names-db().

`rgx:codepoints-to-string()`

rgx:codepoints-to-string($arg as xs:integer*) as xs:string?

one-parameter function for the one below; default XML 1.0

Used by function rgx:codepoints-to-string(), rgx:process-regex-escape-u().

Relies upon rgx:codepoints-to-string().

Option 2 (regex-ext-tan-functions)

rgx:codepoints-to-string($arg as xs:integer*, $xml-1-0 as xs:boolean) as xs:string?

Input: any number of integers

Output: the string value representation, but only if the integers represent valid 
characters in XML

Like fn:codepoints-to-string(), but filters out XML illegal characters

Used by function rgx:codepoints-to-string(), rgx:process-regex-escape-u().

Does not rely upon global variables, keys, functions, or templates.

`rgx:dec-to-hex()`

rgx:dec-to-hex($in as xs:integer?) as xs:string?

Input: xs:integer

Output: the hexadecimal equivalent as a string, e.g., 31 - > '1F'

No variables, keys, functions, or named templates depend upon this xsl:function.

Relies upon rgx:dec-to-n().

`rgx:dec-to-n()`

rgx:dec-to-n($in as xs:integer?, $base as xs:integer) as xs:string?

Input: two integers

Output: a string that represents the first numeral in base N, where N is the second 
numeral

Used by function rgx:dec-to-hex(), rgx:dec-to-n().

Relies upon $base64-key, $hex-key, rgx:dec-to-n().

`rgx:escape()`

rgx:escape($strings as xs:string*) as xs:string*

Input: any sequence of strings

Output: each string prepared for regular expression searches, i.e., with reserved 
characters escaped out.

No variables, keys, functions, or named templates depend upon this xsl:function.

Relies upon $characters-to-escape-when-converting-string-to-regex.

`rgx:get-chars-by-name()`

rgx:get-chars-by-name($words-in-name as xs:string*, $words-not-in-name as xs:string*, $version as xs:double) as element()*

Input: two sets of strings

Output: <char> elements from the Unicode database, the words of whose name (or 
alias) match all the first set and none of the second

Used by function rgx:process-regex-escape-u(), rgx:build-char-replacement-guide().

Relies upon $default-ucd-names-db, rgx:get-ucd-names-db().

`rgx:get-ucd-decomp-db()`

rgx:get-ucd-decomp-db()

one-parameter version of fuller one below

Used by variable $default-ucd-decomp-db.

Used by function rgx:get-ucd-decomp-db(), rgx:string-to-components(), rgx:string-to-composites().

Relies upon rgx:get-ucd-decomp-db().

Option 2 (regex-ext-tan-functions)

rgx:get-ucd-decomp-db($version as xs:double)

Input: a double specifying a Unicode version number

Output: the document that contains the data for decomposing characters to and from 
their parts

Used by variable $default-ucd-decomp-db.

Used by function rgx:get-ucd-decomp-db(), rgx:string-to-components(), rgx:string-to-composites().

Relies upon rgx:best-unicode-version().

`rgx:get-ucd-decomp-simple-db()`

rgx:get-ucd-decomp-simple-db()

one-parameter version of fuller one below

Used by variable $default-ucd-decomp-simple-db.

Used by function rgx:string-base(), rgx:get-ucd-decomp-simple-db().

Relies upon rgx:get-ucd-decomp-simple-db().

Option 2 (regex-ext-tan-functions)

rgx:get-ucd-decomp-simple-db($version as xs:double)

Input: a double specifying a Unicode version number

Output: the document that contains the data for translating characters to and from 
their base characters

Used by variable $default-ucd-decomp-simple-db.

Used by function rgx:string-base(), rgx:get-ucd-decomp-simple-db().

Relies upon rgx:best-unicode-version().

`rgx:get-ucd-names-db()`

Used by variable $default-ucd-names-db.

rgx:get-ucd-names-db()

zero-parameter version of fuller one below

Used by function rgx:get-ucd-names-db(), rgx:get-chars-by-name(), rgx:build-char-replacement-guide().

Relies upon rgx:get-ucd-names-db().

Option 2 (regex-ext-tan-functions)

rgx:get-ucd-names-db($version as xs:double)

Input: a double specifying a Unicode version number

Output: the document that contains the data for Unicode character names

Used by variable $default-ucd-names-db.

Used by function rgx:get-ucd-names-db(), rgx:get-chars-by-name(), rgx:build-char-replacement-guide().

Relies upon rgx:best-unicode-version().

`rgx:hex-to-dec()`

rgx:hex-to-dec($hex as xs:string?) as xs:integer?

Input: a string representing a hexadecimal number

Output: the integer value, e.g., '1F' - > 31

Relies upon $hex-key.

`rgx:matches()`

rgx:matches($input as xs:string?, $pattern as xs:string) as xs:boolean

two-param function of the three-param version below

Used by function rgx:matches(), rgx:regex-is-valid().

Relies upon rgx:matches().

Option 2 (regex-ext-tan-functions)

rgx:matches($input as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:boolean

Parallel to fn:matches(), but converts \u{} into classes. See rgx:regex() for 
details.

Used by function rgx:matches(), rgx:regex-is-valid().

Relies upon rgx:parse-flags(), rgx:regex().

`rgx:n-to-dec()`

rgx:n-to-dec($input as xs:string?, $base-n as xs:integer) as xs:integer?

Input: string representation of some number; an integer

Output: an integer representing the first parameter in the base system of the 2nd 
parameter

Used by function rgx:n-to-dec().

Relies upon $base64-key, $hex-key, rgx:n-to-dec().

`rgx:parse-flags()`

rgx:parse-flags($flags as xs:string) as element()

Input: a string corresponding to a $flags parameter in a regular expression 
function

Output: an element that differentiates parts of the string between special 
TAN-regex flags and not

Used by function rgx:replace(), rgx:matches(), rgx:tokenize(), rgx:analyze-string().

Relies upon rgx:best-unicode-version().

`rgx:parse-regex()`

rgx:parse-regex($regex as xs:string?, $version as xs:double) as element()

Input: a regular expression

Output: an element with the regular expression parsed

Any errors are embedded as <error>s

Used by function rgx:regex().

Relies upon $close-group-symbols-regex, $escapes-in-regex, $open-group-symbols-regex, rgx:process-regex-escape-u().

`rgx:process-regex-escape-u()`

rgx:process-regex-escape-u($val-inside-braces as xs:string) as xs:string?

one-parameter version of fuller one, below

Used by function rgx:process-regex-escape-u(), rgx:parse-regex().

Relies upon rgx:process-regex-escape-u().

Option 2 (regex-ext-tan-functions)

rgx:process-regex-escape-u($val-inside-braces as xs:string, $version as xs:double) as xs:string?

Input: a string that is inside the braces of a \u{} expression

Output: the expansion of the expression

Acceptable contents of \u{}:

1. Individual hex values or ranges of them, separated by a comma or space. Values will 
be replaced with entities

'4d-4f, 51' > '&#x4d;-&#x4f;&#x51;'

2. Composite signal: + followed by a string

'+b' > 'bᵇḃḅḇ⒝ⓑ㍴㏔㏝ｂ𝐛𝑏𝒃𝒷𝓫𝔟𝕓𝖇𝖻𝗯𝘣𝙗𝚋'

3. Base signal: - followed by a string

'-ḉ' > 'c'

4. name keywords: chains of . or ! each followed by a string

'.greek.capital.perispomeni' > 'ἎἏἮἯἾἿὟὮὯᾎᾏᾞᾟᾮᾯ' .latin.cedilla - - > 
'ÇçĢģĶķĻļŅņŖŗŞşŢţȨȩᷗḈḉḐḑḜḝḨḩ' .m!small - - > 'MƜൔᒻᒼᒾᒿᛗᛘᛙᣘᧄᮿᰮᴹḾṀṂℳⓂⱮㄇ㎛㎡㎥㎧㎨㏁㏞㏟ꚳꟽꟿꩌＭ'

Used by function rgx:process-regex-escape-u(), rgx:parse-regex().

Relies upon $base-marker-regex, $characters-allowed-in-ucd-names-regex, $composite-marker-regex, $name-marker-regex, $u-item-delimiter-regex, rgx:codepoints-to-string(), rgx:get-chars-by-name(), rgx:hex-to-dec(), rgx:string-base(), rgx:string-to-composites().

`rgx:regex()`

rgx:regex($regex as xs:string?) as xs:string?

one-parameter version of the longer one, below

Used by function rgx:regex(), rgx:replace(), rgx:matches(), rgx:tokenize(), rgx:analyze-string().

Relies upon rgx:regex().

Option 2 (regex-ext-tan-functions)

rgx:regex($regex as xs:string?, $version as xs:double) as xs:string?

Input: string representing a regex pattern

Output: the regular expression adjusted according to TAN-regex rules

Used by function rgx:regex(), rgx:replace(), rgx:matches(), rgx:tokenize(), rgx:analyze-string().

Relies upon rgx:parse-regex().

`rgx:regex-is-valid()`

rgx:regex-is-valid($input-regex as xs:string?) as xs:boolean

Input: a string

Output: true if the string is a valid regular expression, false otherwise

No variables, keys, functions, or named templates depend upon this xsl:function.

Relies upon rgx:matches().

`rgx:replace()`

rgx:replace($input as xs:string?, $pattern as xs:string, $replacement as xs:string) as xs:string

three-param function of the four-param version below

Used by function rgx:replace().

Relies upon rgx:replace().

Option 2 (regex-ext-tan-functions)

rgx:replace($input as xs:string?, $pattern as xs:string, $replacement as xs:string, $flags as xs:string) as xs:string

Parallel to fn:replace(), but converts \u{} into classes. See rgx:regex() for 
details.

Used by function rgx:replace().

Relies upon rgx:parse-flags(), rgx:regex().

`rgx:replace-by-char-name()`

rgx:replace-by-char-name($string-to-replace as xs:string?, $words-in-name-to-drop as xs:string*, $words-in-replacement-char-name as xs:string*, $words-not-in-replacement-char-name as xs:string*, $search-is-strict as xs:boolean?) as xs:string?

five-parameter version of the full function, below

Used by function rgx:replace-by-char-name().

Relies upon rgx:replace-by-char-name().

Option 2 (regex-ext-tan-functions)

five-parameter version of the full function, below

Used by function rgx:replace-by-char-name().

Relies upon rgx:build-char-replacement-guide(), rgx:replace-by-char-name().

Option 3 (regex-ext-tan-functions)

rgx:replace-by-char-name($string-to-replace as xs:string?, $replace-guide as element(rgx:replace)) as xs:string?

Input: a string to be changed; three sets of strings; a boolean

Output: the string with characters replaced according to the rules below

This function was written primarily to transform Greek letters, e.g., to change 
graves into acutes

The input string is broken into individual characters. Focus is placed on only those 
characters whose Unicode name has words matching $words-in-name-to-drop.. Other words in the 
first matching name are retained, and a search is made for any other Unicode character that 
has names specified by $words-in-replacement-char-name and does not have words 
specified by $words-not-in-replacement-char-name..

If the boolean is false, then the search will return Unicode codepoints that might 
have other words in their name; otherwise the match must correspond to all words in the 
target name.

If the character does not have an entry in the $replace-guide, the original 
character is returned.

The process will be applied to a char against only the first name found, not aliases.

To use this function optimally, first bind the second parameter to a global 
variable, using rgx:build-char-replacement-guide(), then use the 2-parameter version of 
this function.

Used by function rgx:replace-by-char-name().

Does not rely upon global variables, keys, functions, or templates.

`rgx:string-base()`

rgx:string-base($arg as xs:string?) as xs:string?

one-param version of the fuller one, below

Used by function rgx:string-base(), rgx:process-regex-escape-u().

Relies upon rgx:string-base().

Option 2 (regex-ext-tan-functions)

rgx:string-base($arg as xs:string?, $version as xs:double) as xs:string?

This function takes any string and replaces every character with its base Unicode 
character. This function is useful to prepare a text to be searched without respect to accents. 
E.g., ἄνθρωπός - > ανθρωπος Note, the ς is retained because it doesn't decompose. To match 
on σ one needs to use the flag 'i' (case insensitive) because ς case-folds to σ. This 
function is similar to rgx:string-to-components(), but strictly enforces a one-for-one 
replacement, so that it behaves much like fn:lower-case() and fn:upper-case(), where the string 
length is always preserved. To this end, this function is based on fn:translate(), and uses 
simple decomposition databases, which are much smaller and quicker to use than are full 
decomposition databases. The strict one-for-one replacement observes the following rules: If a 
character decomposes to a single character, that single character is returned. If a character 
decomposes to multiple characters that are identical, that single character is returned, 
e.g., ‴ to ′ If a character decomposes to multiple characters, a distinction is made 
between base and non-base characters: - Base characters: 
\p{Lu}\p{Ll}\p{Lt}\p{Lo}\p{N}\p{S} - Non-base characters: \p{Lm}\p{M}\p{P}\p{Z}\p{C} If after non-base characters 
are removed there is not exactly one unique decomposed character left, the original 
input is retained. The above rules are already reflected in the contents of the simple 
decomposition database, so do not need to be expressed in this function. For more, see 
ucd/ucd-decomp.xsl.

Used by function rgx:string-base(), rgx:process-regex-escape-u().

Relies upon rgx:get-ucd-decomp-simple-db().

`rgx:string-to-components()`

rgx:string-to-components($arg as xs:string?) as xs:string*

one-param version of the fuller one, below

Used by function rgx:string-to-components().

Relies upon rgx:string-to-components().

Option 2 (regex-ext-tan-functions)

rgx:string-to-components($arg as xs:string?, $version as xs:double) as xs:string*

Input: any string; a Unicode version number.

Output: one string per character in the input; if a character lends itself to 
decomposition, its component parts are returned, otherwise the character itself is returned.

This function is the inverse of rgx:string-to-composites().

If you wish to have more control over which components are returned (e.g., exclusion 
of combining marks), consider using either rgx:string-base() or the database 
directly: rgx:get-ucd-decomp-db(). The each rgx:char/rgx:b has @gc with the code for the 
component's general category

Used by function rgx:string-to-components().

Relies upon $default-ucd-decomp-db, rgx:get-ucd-decomp-db().

`rgx:string-to-composites()`

rgx:string-to-composites($arg as xs:string?) as xs:string*

one-parameter version of fuller one, below

Used by function rgx:string-to-composites(), rgx:process-regex-escape-u().

Relies upon rgx:string-to-composites().

Option 2 (regex-ext-tan-functions)

rgx:string-to-composites($arg as xs:string?, $version as xs:double) as xs:string*

Input: a string; a version of Unicode (double)

Output: one string per character in the input; that string consists of the character 
itself followed by all characters that use it as a base

This function is the inverse of rgx:string-to-components. E.g., 'Max' - > 
'MᴹḾṀṂℳⅯⓂ㎆㎒㎫㎹㎿㏁Ｍ𝐌𝑀𝑴𝓜𝔐𝕄𝕸𝖬𝗠𝘔𝙈𝙼🄼🅋🅪🅫aªàáâãäåāăąǎǟǡǻȁȃȧᵃḁẚạảấầẩẫậắằẳẵặₐ℀℁ⓐ㏂ａ𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊xˣẋẍₓⅹⅺⅻⓧｘ𝐱𝑥𝒙𝓍𝔁𝔵𝕩𝖝𝗑𝘅𝘹𝙭𝚡' 
This is useful for preparing regex character classes to broaden a search.

Used by function rgx:string-to-composites(), rgx:process-regex-escape-u().

Relies upon $default-ucd-decomp-db, rgx:get-ucd-decomp-db().

`rgx:tokenize()`