At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be justifiably claimed to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., commas inserted by a medieval scribe or a modern scholar into ancient Greek and Latin texts). In the end, the number of meanings for "word" reflects the diversity of scholarship.
TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see the section called “Regular Expressions”).
In TAN, a token is purely a string definition, used as a segmenting and pointing
mechanism. To define a token in TAN does not entail any linguistic commitments.
Neither editors nor users of TAN data should infer that a <tok>
points to a morpheme, a lexeme, or any other
linguistic entity. There will frequently be a fortuitous correlation between the two,
but it is not guaranteed.
TAN was developed with a concern for ancient literature, where punctuation is
generally ignored as being late or not central to the text. Happily, even in
contemporary use, most people ignore punctuation when they count words. Therefore the
default <token-definition>
defines a token as being any continuous
string of word characters (\w
), the soft hyphen, the zero-width space,
or the zero-width joiner, formally defined by the section called “$token-definition-default
”:
<token-definition regex="[\w­​‍]+"/>
This pattern closely resembles what is ordinarily thought of as words, but perhaps
with some surprises (see above, the section called “Regular Expressions”). If no <token-definition>
is
explicitly given, the default token definition above will be used.
If you are working with modern texts, where punctuation might be important to name
and number, try the built-in keyword letters and punctuation
:
<token-definition regex="[\w­​‍]+|[^\w­​‍\s]"/>
This expression defines a token as a sequence of word characters or any single
character that is neither a word nor a space. The string (I go!)
would
have five tokens: ( I go ! )
.
For other standard TAN token definitons see the section called “TAN keywords for types of token definitions (<token-definition>)”<token-definition>
s. You
may customize your own <token-definition>
. But keep in mind that TAN files
were meant to be shared across fields and disciplines. You should define tokens in a
way users of your texts expect. Specialized definitions make it difficult to compare
the data in your TAN file with that in others. Two class-2 files annotating the same
class-1 file cannot be easily compared or synthesized if they use different token
definitions.