At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be reasonably defined as being either one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., medieval manuscripts or modern editions of ancient texts). In the end, the many meanings for "word" reflects the diversity of scholarship.
TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see the section called “Regular expressions”).
In TAN, a token is purely a string definition, used to segment and to point. A
token in TAN does not entail any linguistic categories. Neither editors nor users of
TAN data should infer that a <tok>
points to a morpheme, a lexeme, or any other linguistic entity. There will frequently
be a fortuitous correlation between the two, but it is not guaranteed.
TAN was developed with a concern for ancient literature, where punctuation is
generally ignored as being late or not central to the text. Happily, even in
contemporary use, most people ignore punctuation when they count words. Therefore the
default <token-definition>
defines a token as being any continuous
string of word characters (\w
), the soft hyphen, the zero-width space,
or the zero-width joiner, formally defined by the section called “$tan:token-definition-default
”:
<token-definition regex="[\w­​‍]+"/>
This pattern closely resembles what is ordinarily thought of as words, but perhaps
with some surprises (see above, the section called “Regular expressions”). If no <token-definition>
is
explicitly given, the default token definition above will be used.
If you are working with modern texts, where punctuation might be important to name
and number, try the built-in keyword letters and punctuation
:
<token-definition regex="[\w­​‍]+|[^\w­​‍\s]"/>
This expression defines a token as a sequence of word characters or any single
character that is neither a word nor a space. The string (I go!)
would
have five tokens: (
, I
, go
, !
,
and )
.
For other standard TAN token definitons see the section called “TAN keywords for types of token definitions (<token-definition>)”<token-definition>
s. You
may customize your own <token-definition>
. But keep in mind that TAN files
were meant to be shared across fields and disciplines. You should define tokens in a
way users of your texts expect. Two class-2 TAN annotation files with different
tokenization systems can be challenging to collate.