<token-definition>
)Definitive list of key terms used to name standard token definitions.
Master location: http://textalign.net/release/TAN-2021/vocabularies/token-definitions.TAN-voc.xml
Table 11.14. TAN keywords for types of token definitions
names | pattern | Comments |
---|---|---|
letters letters only general word characters only general ignore punctuation gwo | [\w]+ | General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored. |
letters and hyphens | [\w-]+ | General tokenization pattern for any language, only word characters (as defined in Unicode) and the hyphen. All other characters are ignored. |
letters and apostrophes | [\w'’]+ | General tokenization pattern for any language, only word characters (as defined in Unicode) and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks. |
letters hyphens and apostrophes letters apostrophes and hyphens letters, hyphens and apostrophes letters, apostrophes and hyphens letters, hyphens, and apostrophes letters, apostrophes, and hyphens | [\w'’-]+ | General tokenization pattern for any language, only word characters (as defined in Unicode), the hyphen, and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks. |
letters and punctuation general non space characters general include punctuation | [\w]+|[^\w\s] | General tokenization pattern for any language, treating not only series of letters as word tokens but also individual non-letter characters (e.g., punctuation). |
nonspace | \S+ | General tokenization pattern for any language, treating any contiguous run of nonspace marks as a word. |