Defining words and tokens

Defining words and tokens
Prev	Chapter 4. Common patterns and structures	Next

At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be reasonably defined as being either one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., medieval manuscripts or modern editions of ancient texts). In the end, the many meanings for "word" reflects the diversity of scholarship.

TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see the section called “Regular expressions”).

In TAN, a token is purely a string definition, used to segment and to point. A token in TAN does not entail any linguistic categories. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed.

TAN was developed with a concern for ancient literature, where punctuation is generally ignored as being late or not central to the text. Happily, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters (\w), the soft hyphen, the zero-width space, or the zero-width joiner, formally defined by the section called “$tan:token-definition-default”:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+"/>

This pattern closely resembles what is ordinarily thought of as words, but perhaps with some surprises (see above, the section called “Regular expressions”). If no <token-definition> is explicitly given, the default token definition above will be used.

If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword letters and punctuation:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+|[^\w&#xad;&#x200b;&#x200d;\s]"/>

This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string (I go!) would have five tokens: (, I, go, !, and ).

For other standard TAN token definitons see the section called “TAN keywords for types of token definitions (<token-definition>)”<token-definition>s. You may customize your own <token-definition>. But keep in mind that TAN files were meant to be shared across fields and disciplines. You should define tokens in a way users of your texts expect. Two class-2 TAN annotation files with different tokenization systems can be challenging to collate.

Prev	Up	Next
Attribute inheritability and priority	Home	Metadata (<head>)