Defining Words and Tokens

Defining Words and Tokens
Prev	Chapter 4. Patterns and Structures Common to All TAN Encoding Formats	Next

At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be justifiably claimed to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., commas inserted by a medieval scribe or a modern scholar into ancient Greek and Latin texts). In the end, the number of meanings for "word" reflects the diversity of scholarship.

TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see the section called “Regular Expressions”).

In TAN, a token is purely a string definition, used as a segmenting and pointing mechanism. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed.

TAN was developed with a concern for ancient literature, where punctuation is generally ignored as being late or not central to the text. Happily, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters (\w), the soft hyphen, the zero-width space, or the zero-width joiner, formally defined by the section called “$token-definition-default”:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+"/>

This pattern closely resembles what is ordinarily thought of as words, but perhaps with some surprises (see above, the section called “Regular Expressions”). If no <token-definition> is explicitly given, the default token definition above will be used.

If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword letters and punctuation:

<token-definition regex="[\w&#xad;&#x200b;&#x200d;]+|[^\w&#xad;&#x200b;&#x200d;\s]"/>

This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string (I go!) would have five tokens: ( I go ! ).

For other standard TAN token definitons see the section called “TAN keywords for types of token definitions (<token-definition>)”<token-definition>s. You may customize your own <token-definition>. But keep in mind that TAN files were meant to be shared across fields and disciplines. You should define tokens in a way users of your texts expect. Specialized definitions make it difficult to compare the data in your TAN file with that in others. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different token definitions.

Prev	Up	Next
Attribute inheritability and priority	Home	Metadata (<head>)