<token-definition>
)Definitive list of key terms used to name standard token definitions.
Master location: http://textalign.net/release/TAN-2020/vocabularies/token-definitions.TAN-voc.xml
Table 10.14. TAN keywords for types of token definitions
vocabularies (optional values of @which ) | pattern | Comments |
---|---|---|
|
| General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored. |
|
| General tokenization pattern for any language, only word characters (as defined in Unicode) and the hyphen. All other characters are ignored. |
|
| General tokenization pattern for any language, only word characters (as defined in Unicode) and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks. |
|
| General tokenization pattern for any language, only word characters (as defined in Unicode), the hyphen, and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks. |
|
| General tokenization pattern for any language, treating not only series of letters as word tokens but also individual non-letter characters (e.g., punctuation). |
|
| General tokenization pattern for any language, treating any contiguous run of nonspace marks as a word. |