TAN keywords for types of token definitions (<token-definition>)

Definitive list of key terms used to name standard token definitions.

Master location: http://textalign.net/release/TAN-1-dev/TAN-key/tokenizations.TAN-key.xml

Table 9.11. TAN keywords for types of token definitions

keywords (optional values of @which)regexComments
  • letters

  • letters only

  • general-words-only-1

  • general-words-only

  • gwo

  • [\w­​‍]+

General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored.

  • letters and punctuation

  • general-1

  • general

  • gen

  • \w+|[^\w\s]

General tokenization pattern for any language, treating not only series of letters as word tokens but also individual non-letter characters (e.g., punctuation).

  • nonspace

  • \S+

General tokenization pattern for any language, treating any contiguous run of nonspace marks as a word.