regex-ext-tan-functions
Definition: '-'
Used by function rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '\')
Used by function rgx:dec-to-n()
, rgx:n-to-dec()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: '[-#\(\)a-zA-Z0-9]'
Used by function rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: '[\.\[\]\\\|\^\$\?\*\+\{\}\(\)]'
Used by function rgx:escape()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: [\]\)\}]
Used by function rgx:parse-regex()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: '\+'
Used by function rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: rgx:get-ucd-decomp-db()
Used by function rgx:string-to-components()
, rgx:string-to-composites()
.
Relies upon rgx:get-ucd-decomp-db()
.
regex-ext-tan-functions
Definition: rgx:get-ucd-decomp-simple-db()
No variables, keys, functions, or named templates depend upon this xsl:variable.
Relies upon rgx:get-ucd-decomp-simple-db()
.
regex-ext-tan-functions
Definition: rgx:get-ucd-names-db()
Used by function rgx:get-chars-by-name()
, rgx:build-char-replacement-guide()
.
Relies upon rgx:get-ucd-names-db()
.
regex-ext-tan-functions
Definition: '\\[\.\[\]\\\|\^\$\?\*\+\{\}\(\)nrtdDsSiIcCwW\d]|\\[pPu]\{[^\}]*\}'
Used by function rgx:parse-regex()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F')
Used by function rgx:dec-to-n()
, rgx:n-to-dec()
, rgx:hex-to-dec()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: '[\.!]'
Used by function rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: [\[\(\{]
Used by function rgx:parse-regex()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: ' '
Used by function rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
Definition: 5.1, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0
Used by function rgx:best-unicode-version()
.
Does not rely upon global variables, keys, functions, or templates.
Option 1 (regex-ext-tan-functions)
rgx:analyze-string($input as xs:string?, $pattern as xs:string) as element()
Used by function rgx:analyze-string()
.
Relies upon rgx:analyze-string()
.
Option 2 (regex-ext-tan-functions)
rgx:analyze-string($input as xs:string?, $pattern as xs:string, $flags as xs:string) as element()
Used by function rgx:analyze-string()
.
Relies upon rgx:parse-flags()
, rgx:regex()
.
regex-ext-tan-functions
rgx:best-unicode-version($version as xs:double?) as xs:double
Input: a double representing a Unicode version
Output: the best version supported
Used by function rgx:get-ucd-decomp-db()
, rgx:get-ucd-decomp-simple-db()
, rgx:get-ucd-names-db()
, rgx:parse-flags()
.
Relies upon $unicode-versions-supported
.
regex-ext-tan-functions
rgx:build-char-replacement-guide($words-in-name-to-drop as xs:string*, $words-in-replacement-char-name as xs:string*, $words-not-in-replacement-char-name as xs:string*, $search-is-strict as xs:boolean?, $version as xs:double) as element()
Input: three sequences of strings; a boolean (whether matches should be strict); a double (Unicode version)
Output: an XML tree rgx:replace/rgx:char/rgx:with specifying that every rgx:char/@val
should be replaced by a string-joining of its rgx:with/@val.
.
This function should be used to optimize replacement through a global variable. See
documentation at rgx:replace-by-char-name()
, which this function supports.
Used by function rgx:replace-by-char-name()
.
Relies upon $default-ucd-names-db
, rgx:get-chars-by-name()
, rgx:get-ucd-names-db()
.
Option 1 (regex-ext-tan-functions)
rgx:codepoints-to-string($arg as xs:integer*) as xs:string?
one-parameter function for the one below; default XML 1.0
Used by function rgx:codepoints-to-string()
, rgx:process-regex-escape-u()
.
Relies upon rgx:codepoints-to-string()
.
Option 2 (regex-ext-tan-functions)
rgx:codepoints-to-string($arg as xs:integer*, $xml-1-0 as xs:boolean) as xs:string?
Input: any number of integers
Output: the string value representation, but only if the integers represent valid characters in XML
Like fn:codepoints-to-string()
, but filters out XML illegal characters
Used by function rgx:codepoints-to-string()
, rgx:process-regex-escape-u()
.
Does not rely upon global variables, keys, functions, or templates.
regex-ext-tan-functions
rgx:dec-to-hex($in as xs:integer?) as xs:string?
Input: xs:integer
Output: the hexadecimal equivalent as a string, e.g., 31 - > '1F'
No variables, keys, functions, or named templates depend upon this xsl:function.
Relies upon rgx:dec-to-n()
.
regex-ext-tan-functions
rgx:dec-to-n($in as xs:integer?, $base as xs:integer) as xs:string?
Input: two integers
Output: a string that represents the first numeral in base N, where N is the second numeral
Used by function rgx:dec-to-hex()
, rgx:dec-to-n()
.
Relies upon $base64-key
, $hex-key
, rgx:dec-to-n()
.
regex-ext-tan-functions
rgx:escape($strings as xs:string*) as xs:string*
Input: any sequence of strings
Output: each string prepared for regular expression searches, i.e., with reserved characters escaped out.
No variables, keys, functions, or named templates depend upon this xsl:function.
Relies upon $characters-to-escape-when-converting-string-to-regex
.
regex-ext-tan-functions
rgx:get-chars-by-name($words-in-name as xs:string*, $words-not-in-name as xs:string*, $version as xs:double) as element()*
Input: two sets of strings
Output: <char>
elements from the Unicode database, the words of whose name (or
alias) match all the first set and none of the second
Used by function rgx:process-regex-escape-u()
, rgx:build-char-replacement-guide()
.
Relies upon $default-ucd-names-db
, rgx:get-ucd-names-db()
.
Option 1 (regex-ext-tan-functions)
rgx:get-ucd-decomp-db()
one-parameter version of fuller one below
Used by variable $default-ucd-decomp-db
.
Used by function rgx:get-ucd-decomp-db()
, rgx:string-to-components()
, rgx:string-to-composites()
.
Relies upon rgx:get-ucd-decomp-db()
.
Option 2 (regex-ext-tan-functions)
rgx:get-ucd-decomp-db($version as xs:double)
Input: a double specifying a Unicode version number
Output: the document that contains the data for decomposing characters to and from their parts
Used by variable $default-ucd-decomp-db
.
Used by function rgx:get-ucd-decomp-db()
, rgx:string-to-components()
, rgx:string-to-composites()
.
Relies upon rgx:best-unicode-version()
.
Option 1 (regex-ext-tan-functions)
rgx:get-ucd-decomp-simple-db()
one-parameter version of fuller one below
Used by variable $default-ucd-decomp-simple-db
.
Used by function rgx:string-base()
, rgx:get-ucd-decomp-simple-db()
.
Relies upon rgx:get-ucd-decomp-simple-db()
.
Option 2 (regex-ext-tan-functions)
rgx:get-ucd-decomp-simple-db($version as xs:double)
Input: a double specifying a Unicode version number
Output: the document that contains the data for translating characters to and from their base characters
Used by variable $default-ucd-decomp-simple-db
.
Used by function rgx:string-base()
, rgx:get-ucd-decomp-simple-db()
.
Relies upon rgx:best-unicode-version()
.
Option 1 (regex-ext-tan-functions)
rgx:get-ucd-names-db()
zero-parameter version of fuller one below
Used by variable $default-ucd-names-db
.
Used by function rgx:get-ucd-names-db()
, rgx:get-chars-by-name()
, rgx:build-char-replacement-guide()
.
Relies upon rgx:get-ucd-names-db()
.
Option 2 (regex-ext-tan-functions)
rgx:get-ucd-names-db($version as xs:double)
Input: a double specifying a Unicode version number
Output: the document that contains the data for Unicode character names
Used by variable $default-ucd-names-db
.
Used by function rgx:get-ucd-names-db()
, rgx:get-chars-by-name()
, rgx:build-char-replacement-guide()
.
Relies upon rgx:best-unicode-version()
.
regex-ext-tan-functions
rgx:hex-to-dec($hex as xs:string?) as xs:integer?
Input: a string representing a hexadecimal number
Output: the integer value, e.g., '1F' - > 31
Used by function rgx:process-regex-escape-u()
.
Relies upon $hex-key
.
Option 1 (regex-ext-tan-functions)
rgx:matches($input as xs:string?, $pattern as xs:string) as xs:boolean
two-param function of the three-param version below
Used by function rgx:matches()
, rgx:regex-is-valid()
.
Relies upon rgx:matches()
.
Option 2 (regex-ext-tan-functions)
rgx:matches($input as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:boolean
Parallel tofn:matches()
, but converts \u{} into classes. Seergx:regex()
for details.
Used by function rgx:matches()
, rgx:regex-is-valid()
.
Relies upon rgx:parse-flags()
, rgx:regex()
.
regex-ext-tan-functions
rgx:n-to-dec($input as xs:string?, $base-n as xs:integer) as xs:integer?
Input: string representation of some number; an integer
Output: an integer representing the first parameter in the base system of the 2nd parameter
Used by function rgx:n-to-dec()
.
Relies upon $base64-key
, $hex-key
, rgx:n-to-dec()
.
regex-ext-tan-functions
rgx:parse-flags($flags as xs:string) as element()
Input: a string corresponding to a $flags
parameter in a regular expression
function
Output: an element that differentiates parts of the string between special TAN-regex flags and not
Used by function rgx:replace()
, rgx:matches()
, rgx:tokenize()
, rgx:analyze-string()
.
Relies upon rgx:best-unicode-version()
.
regex-ext-tan-functions
rgx:parse-regex($regex as xs:string?, $version as xs:double) as element()
Input: a regular expression
Output: an element with the regular expression parsed
Any errors are embedded as <error>
s
Used by function rgx:regex()
.
Relies upon $close-group-symbols-regex
, $escapes-in-regex
, $open-group-symbols-regex
, rgx:process-regex-escape-u()
.
Option 1 (regex-ext-tan-functions)
rgx:process-regex-escape-u($val-inside-braces as xs:string) as xs:string?
one-parameter version of fuller one, below
Used by function rgx:process-regex-escape-u()
, rgx:parse-regex()
.
Relies upon rgx:process-regex-escape-u()
.
Option 2 (regex-ext-tan-functions)
rgx:process-regex-escape-u($val-inside-braces as xs:string, $version as xs:double) as xs:string?
Input: a string that is inside the braces of a \u{} expression
Output: the expansion of the expression
Acceptable contents of \u{}:
1. Individual hex values or ranges of them, separated by a comma or space. Values will be replaced with entities
'4d-4f, 51' > 'M-OQ'
2. Composite signal: + followed by a string
'+b' > 'bᵇḃḅḇ⒝ⓑ㍴㏔㏝b𝐛𝑏𝒃𝒷𝓫𝔟𝕓𝖇𝖻𝗯𝘣𝙗𝚋'
3. Base signal: - followed by a string
'-ḉ' > 'c'
4. name keywords: chains of . or ! each followed by a string
'.greek.capital.perispomeni' > 'ἎἏἮἯἾἿὟὮὯᾎᾏᾞᾟᾮᾯ' .latin.cedilla - - > 'ÇçĢģĶķĻļŅņŖŗŞşŢţȨȩᷗḈḉḐḑḜḝḨḩ' .m!small - - > 'MƜൔᒻᒼᒾᒿᛗᛘᛙᣘᧄᮿᰮᴹḾṀṂℳⓂⱮㄇ㎛㎡㎥㎧㎨㏁㏞㏟ꚳꟽꟿꩌM'
Used by function rgx:process-regex-escape-u()
, rgx:parse-regex()
.
Relies upon $base-marker-regex
, $characters-allowed-in-ucd-names-regex
, $composite-marker-regex
, $name-marker-regex
, $u-item-delimiter-regex
, rgx:codepoints-to-string()
, rgx:get-chars-by-name()
, rgx:hex-to-dec()
, rgx:string-base()
, rgx:string-to-composites()
.
Option 1 (regex-ext-tan-functions)
rgx:regex($regex as xs:string?) as xs:string?
one-parameter version of the longer one, below
Used by function rgx:regex()
, rgx:replace()
, rgx:matches()
, rgx:tokenize()
, rgx:analyze-string()
.
Relies upon rgx:regex()
.
Option 2 (regex-ext-tan-functions)
rgx:regex($regex as xs:string?, $version as xs:double) as xs:string?
Input: string representing a regex pattern
Output: the regular expression adjusted according to TAN-regex rules
Used by function rgx:regex()
, rgx:replace()
, rgx:matches()
, rgx:tokenize()
, rgx:analyze-string()
.
Relies upon rgx:parse-regex()
.
regex-ext-tan-functions
rgx:regex-is-valid($input-regex as xs:string?) as xs:boolean
Input: a string
Output: true if the string is a valid regular expression, false otherwise
No variables, keys, functions, or named templates depend upon this xsl:function.
Relies upon rgx:matches()
.
Option 1 (regex-ext-tan-functions)
rgx:replace($input as xs:string?, $pattern as xs:string, $replacement as xs:string) as xs:string
three-param function of the four-param version below
Used by function rgx:replace()
.
Relies upon rgx:replace()
.
Option 2 (regex-ext-tan-functions)
rgx:replace($input as xs:string?, $pattern as xs:string, $replacement as xs:string, $flags as xs:string) as xs:string
Parallel tofn:replace()
, but converts \u{} into classes. Seergx:regex()
for details.
Used by function rgx:replace()
.
Relies upon rgx:parse-flags()
, rgx:regex()
.
Option 1 (regex-ext-tan-functions)
rgx:replace-by-char-name($string-to-replace as xs:string?, $words-in-name-to-drop as xs:string*, $words-in-replacement-char-name as xs:string*, $words-not-in-replacement-char-name as xs:string*, $search-is-strict as xs:boolean?) as xs:string?
five-parameter version of the full function, below
Used by function rgx:replace-by-char-name()
.
Relies upon rgx:replace-by-char-name()
.
Option 2 (regex-ext-tan-functions)
rgx:replace-by-char-name($string-to-replace as xs:string?, $words-in-name-to-drop as xs:string*, $words-in-replacement-char-name as xs:string*, $words-not-in-replacement-char-name as xs:string*, $search-is-strict as xs:boolean?, $version as xs:double) as xs:string?
five-parameter version of the full function, below
Used by function rgx:replace-by-char-name()
.
Relies upon rgx:build-char-replacement-guide()
, rgx:replace-by-char-name()
.
Option 3 (regex-ext-tan-functions)
rgx:replace-by-char-name($string-to-replace as xs:string?, $replace-guide as element(rgx:replace)) as xs:string?
Input: a string to be changed; three sets of strings; a boolean
Output: the string with characters replaced according to the rules below
This function was written primarily to transform Greek letters, e.g., to change graves into acutes
The input string is broken into individual characters. Focus is placed on only those characters whose Unicode name has words matching$words-in-name-to-drop.
. Other words in the first matching name are retained, and a search is made for any other Unicode character that has names specified by$words-in-replacement-char-name
and does not have words specified by$words-not-in-replacement-char-name.
.
If the boolean is false, then the search will return Unicode codepoints that might have other words in their name; otherwise the match must correspond to all words in the target name.
If the character does not have an entry in the $replace-guide
, the original
character is returned.
The process will be applied to a char against only the first name found, not aliases.
To use this function optimally, first bind the second parameter to a global
variable, using rgx:build-char-replacement-guide()
, then use the 2-parameter version of
this function.
Used by function rgx:replace-by-char-name()
.
Does not rely upon global variables, keys, functions, or templates.
Option 1 (regex-ext-tan-functions)
rgx:string-base($arg as xs:string?) as xs:string?
one-param version of the fuller one, below
Used by function rgx:string-base()
, rgx:process-regex-escape-u()
.
Relies upon rgx:string-base()
.
Option 2 (regex-ext-tan-functions)
rgx:string-base($arg as xs:string?, $version as xs:double) as xs:string?
This function takes any string and replaces every character with its base Unicode character. This function is useful to prepare a text to be searched without respect to accents. E.g., ἄνθρωπός - > ανθρωπος Note, the ς is retained because it doesn't decompose. To match on σ one needs to use the flag 'i' (case insensitive) because ς case-folds to σ. This function is similar torgx:string-to-components()
, but strictly enforces a one-for-one replacement, so that it behaves much likefn:lower-case()
andfn:upper-case()
, where the string length is always preserved. To this end, this function is based onfn:translate()
, and uses simple decomposition databases, which are much smaller and quicker to use than are full decomposition databases. The strict one-for-one replacement observes the following rules: If a character decomposes to a single character, that single character is returned. If a character decomposes to multiple characters that are identical, that single character is returned, e.g., ‴ to ′ If a character decomposes to multiple characters, a distinction is made between base and non-base characters: - Base characters: \p{Lu}\p{Ll}\p{Lt}\p{Lo}\p{N}\p{S} - Non-base characters: \p{Lm}\p{M}\p{P}\p{Z}\p{C} If after non-base characters are removed there is not exactly one unique decomposed character left, the original input is retained. The above rules are already reflected in the contents of the simple decomposition database, so do not need to be expressed in this function. For more, see ucd/ucd-decomp.xsl.
Used by function rgx:string-base()
, rgx:process-regex-escape-u()
.
Relies upon rgx:get-ucd-decomp-simple-db()
.
Option 1 (regex-ext-tan-functions)
rgx:string-to-components($arg as xs:string?) as xs:string*
one-param version of the fuller one, below
Used by function rgx:string-to-components()
.
Relies upon rgx:string-to-components()
.
Option 2 (regex-ext-tan-functions)
rgx:string-to-components($arg as xs:string?, $version as xs:double) as xs:string*
Input: any string; a Unicode version number.
Output: one string per character in the input; if a character lends itself to decomposition, its component parts are returned, otherwise the character itself is returned.
This function is the inverse of rgx:string-to-composites()
.
If you wish to have more control over which components are returned (e.g., exclusion of combining marks), consider using eitherrgx:string-base()
or the database directly:rgx:get-ucd-decomp-db()
. The each rgx:char/rgx:b has@gc
with the code for the component's general category
Used by function rgx:string-to-components()
.
Relies upon $default-ucd-decomp-db
, rgx:get-ucd-decomp-db()
.
Option 1 (regex-ext-tan-functions)
rgx:string-to-composites($arg as xs:string?) as xs:string*
one-parameter version of fuller one, below
Used by function rgx:string-to-composites()
, rgx:process-regex-escape-u()
.
Relies upon rgx:string-to-composites()
.
Option 2 (regex-ext-tan-functions)
rgx:string-to-composites($arg as xs:string?, $version as xs:double) as xs:string*
Input: a string; a version of Unicode (double)
Output: one string per character in the input; that string consists of the character itself followed by all characters that use it as a base
This function is the inverse of rgx:string-to-components. E.g., 'Max' - > 'MᴹḾṀṂℳⅯⓂ㎆㎒㎫㎹㎿㏁M𝐌𝑀𝑴𝓜𝔐𝕄𝕸𝖬𝗠𝘔𝙈𝙼🄼🅋🅪🅫aªàáâãäåāăąǎǟǡǻȁȃȧᵃḁẚạảấầẩẫậắằẳẵặₐ℀℁ⓐ㏂a𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊xˣẋẍₓⅹⅺⅻⓧx𝐱𝑥𝒙𝓍𝔁𝔵𝕩𝖝𝗑𝘅𝘹𝙭𝚡' This is useful for preparing regex character classes to broaden a search.
Used by function rgx:string-to-composites()
, rgx:process-regex-escape-u()
.
Relies upon $default-ucd-decomp-db
, rgx:get-ucd-decomp-db()
.
Option 1 (regex-ext-tan-functions)
rgx:tokenize($input as xs:string?, $pattern as xs:string) as xs:string*
two-param function of the three-param version below
Used by function rgx:tokenize()
.
Relies upon rgx:tokenize()
.
Option 2 (regex-ext-tan-functions)
rgx:tokenize($input as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:string*
Parallel tofn:tokenize()
, but converts \u{} into classes. Seergx:regex()
for details.
Used by function rgx:tokenize()
.
Relies upon rgx:parse-flags()
, rgx:regex()
.