Table of Contents
This chapter provides general background to class-1 TAN files and their elements and attributes. For detailed discussion see Chapter 9, TAN patterns, elements, and attributes defined.
Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Class-1 files are the foundation of any TAN project. No TAN-A-tok or TAN-A-lm file can be created without at least one class-1 file.
There are two types of class-1 formats, identified by the root element. <TAN-T>
is a simple, generic format, as
close as one can get to plain text. <TEI>
(also referred to in this
manual as TAN-T(EI)), on the other hand, can be complex and highly expressive. Because
the two formats function almost identically, the generic TAN-T format is described
first, followed by supplemental comments on TAN-TEI.
(For more general principles and assumptions applying to all TAN files, not just class 1, see the section called “Design Principles”.)
Class-1 formats are designed for faithful but judiciously normalized digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a (preferably familiar) reference system.
Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the text of a primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how the transcription might be used in other contexts, especially translation studies or a study of quotations.
Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicography. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills are more profitably spent editing other TAN formats.
TAN-T(EI) editors stand at the foundation level of the Text Alignment Network.
Because other files will depend upon them, careful proofreading is important.
Eliminating as many typographical errors as possible before publication will
maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed
with the assumption that most files in circulation have typographical errors that
can and should be corrected as they are found. If you are aware that a text needs
proofreading, but you still want to make it available, simply leave a <comment>
in the <to-do>
part of the <head>
.
If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate
alignment and annotation, which requires use of a suitable reference system (see
reference systems). Transcription files
should be segmented and labeled according to a reference system that is familiar
and can be easily applied to other versions of the same text in other languages.
If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should
be prioritized over visual (lines, columns, pages, volumes). Any transcription can
be furnished multiple reference systems, but it is advisable to do so on the basis
of separate files, linked by <redivision>
s in the <head>
. See the section called “One Reference System”.
Contributors and users of TAN files must sharply distinguish between a scriptum
(text-bearing object) and a conceptual work, e.g., between a specific printed copy
of the Iliad and the Iliad concieved
generally. The former has materiality (digital files are treated here as being
material) and the latter does not. Even though both are constitutively necessary
for any transcription, the two are always differentiated in the TAN format:
<source>
and
@src
point to physical
exemplars; <work>
,
@work
, and <version>
to the conceptual.
Adherence to this distinction is quite important.
Some readers may be reminded at this point of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies in its Group 1 (Products of intellectual & artistic endeavor) four types of entities: work, expression, manifestation, and item. A work is "a distinct intellectual or artistic creation" and an expression is the conceptual, immaterial realization of a work. Both work and expression are terms for conceptual, non-material entities. A manifestation, on the other hand, is "the physical embodiment of an expression" and an item is a single exemplar of a manifestation.
Note | |
---|---|
Quotations in this section come from International Federation of Library Associations and Institutions, Functional Requirements for Bibliographic Records: Final Report, amended and corrected (February 2009), http://www.ifla.org/VII/s13/frbr/. |
Table 5.1. Examples of FRBR Group 1 Entities
Work | Expression | Manifestation | Item |
---|---|---|---|
Iliad | Caroline Alexander's English translation of the Iliad. | the print run identified with ISBN 978-0062046284 | A specific copy |
The Psalms | The (Hebrew) Masoretic Psalter | The 1820 printing of George Offor's edition of the Hebrew Psalms | Biblioteca Palatina Cod. Parm. 1699 |
A River Runs Through It |
Norman MacClean's original version The 1992 film version |
Print run ISBN 0226500608 Blue Ray disc UPC code 004339632533 |
Author's personal print copy Reference print CGB 7432-7438 (deposited in the Library of Congress) |
TAN's domain model differs slightly. The most important difference is abandonment of FRBR's expressions, which was considered problematic in the development of sample TAN data. The term expressions was intended to describe a conceptual, non-material entity, but the FRBR guidelines defined and explained it in vague or material terms.
Note | |
---|---|
"Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text....defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such." (ibid., p. 19, emphasis added) That is, expression includes integral aspects of physical form (e.g., typeface that is integral to the realization). "Inasmuch as the form of expression is an inherent characteristic of the expression, any change in form (e.g., from alpha-numeric notation to spoken word) results in a new expression." (p. 20, emphasis added) |
Even the very term expression and FRBR's preferred synonym, realization, imply materiality (without which nothing can be expressed or realized). Further, FRBR's expression does not easily handle creative adaptations of works that are themselves arguably works in their own right. For example, Euripides' Medea was adapted several centuries later by Seneca the Younger. Seneca's Medea is arguably merely an expression, but has itself been subject to various editions and performances, i.e., expressions. But FRBR does not accommodate expressions of expressions. If Seneca's Medea is treated as a work in its own right, its expression relationship to Euripides' origin is lost, since FRBR does not accommodate works that are expressions of other works.
In the TAN domain model, expression is altogether dropped. There is only one type of conceptual, non-material entity, namely, a work.
The term version in TAN is applied to a work that
substantially follows but varies another work, e.g., translations and adaptations.
But such versions are themselves still works. One work is indicated to be the
version of another if a class-1 file through the <work>
and <version>
declarations.
As for material entities, FRBR's manifestation and item are combined in TAN through the term scriptum. A scriptum is a text-bearing object, e.g., book, manuscript, pamphlet, tombstone, traffic sign, digital file (digital media is interpreted as being material). When scriptum is used in a TAN file, it points either to a single physical item or to a set of physical items that are for all intents and purposes are indistinguishable (i.e., a scriptum reproduced mechanically). A scriptum that points to a manuscript points only to that one particular manuscript. But a scriptum that points to a printed book or a digital file is understood as applying to all copies of that printed book or digital file.
There is at present no formal mechanism to specify whether a scriptum points to one object or a set of objects. The distinction must be inferred from a scriptum's IRI + name pattern. In cases of potential ambiguity, it is up to creators of a TAN file to assign to the scriptum IRIs that avoid confusion. For example, to point to Edward Gibbon's personally annotated copy of the 1763 edition of Herodotus (now held by the Wren Library, Trinity College, Cambridge University), one should not use https://lccn.loc.gov/92189906 or http://www.worldcat.org/oclc/27188122, which point to the set of all copies. In this case, one may need to mint their own IRI, based on the Wren Library's acquisition number, RW.50.15.
In summary, the TAN domain model defines two kinds of entities: works and
scripta. Works, which are immaterial, conceptual entities, may contain other
works, or they may be versions of other works (or work-versions). Scripta, which
are material entities, may contain other scripta, and they may refer either to a
single object or to a set of copies. A work may be instantiated in many scripta,
and similarly, any scriptum may contain many works. Most work-scriptum
relationships can be inferred from the <head>
of a class-1 file, and they may be expressed in a
<TAN-A>
file.
Table 5.2. Examples of TAN Entities
Work | Scriptum |
---|---|
Iliad Caroline Alexander's English translation of the Iliad. |
the print run identified with ISBN 978-0062046284 a specific copy |
The Psalms The (Hebrew) Masoretic Psalter |
The 1820 printing of George Offor's edition of the Hebrew Psalms Biblioteca Palatina Cod. Parm. 1699 |
Norman MacClean's A River Runs Through It The 1992 film A River Runs Through It |
Print run ISBN 0226500608 Author's personal print copy Blue Ray disc UPC code 004339632533 Reference print CGB 7432-7438 (deposited in the Library of Congress) |
Every TAN-T(EI) file must be restricted to a transcription of a single version of a single work found on a single scriptum, segmented and labeled according to a single reference system.
The principle above is critical to the the success of the network. It reduces the risk of confusion and simplifies the files. It follows the generally advisable principle, that master data should be disaggregated.
Each TAN-T(EI) file must transcribe one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created by the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies (what some librarians call a manifestation) is to be regarded as the scriptum.
Identifying and naming a scriptum might require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; other manuscripts are composites, made of several manuscripts. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files.
The scriptum is declared via <source>
, which either takes the IRI + name pattern, or
points to a <scriptum>
vocabulary item. It is a good idea to name
your scriptum with an <IRI>
value in the form of an http
URL that points to a detailed entry
in a library catalogue. Doing so allows users to retrieve extensive, structured
bibliographical information. You also save yourself the hassle of having to
write a detailed, structured bibliographical description. If a URL cannot be
found for <IRI>
, you may
simply coin a tag URN or a UUID. Alternatively, if you find another TAN file
that uses the same scriptum-source, incorporate its <name>
s and <IRI>
s with your own (multiple
<name>
s and
<IRI>
s are a
virtue).
If you need to specify exactly where on a scriptum a work-version appears
(e.g., page range), <comment>
or <desc>
should be used.
The transcription must be restricted to a single creative work, identified
by <work>
(part of the
declarations section of <head>
).
Many scripta have more than one work. Identifying the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. You define the work. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or merely the Beatitudes. Use whichever work you like, but make sure that the TAN-T(EI) file contains nothing but the work you have declared. It should be a complete representation of what is found on the object, even if only partially preserved, and respect as far as is practical the order of the text in the scriptum.
The requirement to provide the entirety of the work-version on the scriptum
is a significant departure from the fourth principle of the section called “Assumptions in the Creation of TAN Data”. Users should be able to
assume that the transcription in a class-1 file covers the entirety of the
work-version chosen, within the particular scriptum. If you are
aware that the transcription is incomplete, leave a <comment>
to that effect in
the <head>
's <to-do>
, identifying which
portions are missing from the transcription.
Well-known works may have a suitable IRI already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN.
The transcription must be restricted to a single version of the creative
work, identified perhaps by <version>
(part of the declarations section of
<head>
). In most
cases, <version>
is
unnecessary, because <work>
in conjunction with <source>
are in most cases sufficient to identify a
particular work-version. But if the source carries multiple versions (e.g., a
bilingual edition of a text), then <version>
should be included, to specify which version
has been transcribed. <version>
can also be used to declare explicitly that the
work mentioned in <version>
is a version of the work mentioned in
<work>
.
If you have a scriptum with multiple versions of a work, and you wish to transcribe them all, each version should be in its own separate TAN-T(EI) file.
There may be cases where individual textual divisions are repeated, not so
much because they represent a different version, but because they are variants
that are integral to the work-version chosen. Creating a separate file for such
individual cases would be both impractical and misleading. Standard TAN
vocabulary for div types includes as a standard item variant
,
which may be use to wrap every variant in its own <div>
, e.g.,
. . . . . <div type="title" n="title"> <div type="variant" n="orig">The Place</div> <div type="variant" n="subscript" xml:lang="grc">Ὁ Τόπος</div> </div> . . . . .
Notes should be included only if they are an integral part of the primary work (i.e., by the same author, not by a later editor). If you think the notes to a work are important, and legitimately a work in their own right, consider putting them in their own TAN-T(EI) file, or converting them to claims in a TAN-A file.
Very few work-versions have IRIs. It is advisable to assign a tag URN or a
UUID. If the IRI you have used for <work>
is in a namespace that you own or control, then
you are entitled to modify it, and you may wish merely to add a suffix to the
work IRI. For example, you might have
tag:urn:example.com,2001:work:a
defined for the work; a 1987
German translation might be specified as
tag:urn:example.com,2001:work:a:ver:1987:deu
.
Every TAN transcription must be segmented into a hierarchy of labeled
divisions, defined in the <body>
through <div>
s and their @n
values.
Those divisions, whenever possible, should align with the reference system that prevails for the work across different versions or translations, in what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute, the term canonical is problematic. It is avoided in these guidelines we refer simply to a work's reference system.
If you have your choice, preference should be given to reference systems that follow the semantic contours of the work, not the physical features of a particular scriptum. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other versions of the work (e.g., translations, paraphrases) will only roughly, if at all, follow a reference system based on features found in a particular scriptum.
Sometimes a scriptum-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that system, but it may entail more labor during the alignment process.
If a given work has more than one common reference system (e.g., the works
of Plato and Aristotle, which have two reference systems—logical and
scriptum-oriented—both of which are standard and important), then the
recommended practice is to create two class-1 files with identical
transcriptions, each one structured by its own reference system. Place in each
file a <redivision>
pointing to the other. Under verbose
validation, you will be notified if there are textual discrepancies between the
transcriptions, and Schematron Quick Fixes will allow you to automatically
update one text to match the other.
Having two or more alternatively divided editions can be quite useful. They could serve as the basis for reference cross-indexes, or to help convert other versions of the work from one reference system to the other.
If there is a good reference system, but the divisions are overly lengthy,
you may introduce subdivisions. But there is no guarantee that the provisional
subdivisions you introduce will be adopted by other editors who create or edit
TAN versions of the same work. Editors working independently upon the same text
and subdividing it, will likely produce discordant schemes. Class-2 formats
provide a mechanism via <adjustments>
to reconcile some basic differences. But
a discordant scheme might be best handled simply by creating a copy, and
restructuring it according to the preferred system, making sure related files
refer to each other through <redivision>
.
If a work does not have a reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to design it so that it can be easily applied to any version of the work, including translations. Prefer logical divisions of text over scriptum-based divisions.
TAN supports five major methods of numeration in reference systems:
Arabic numerals. 1, 2, 3, etc.
Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value will be deducted from the total value; all others are added).
Alphabetic sequences. The 26-letter Latin alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed. (Note, this is not the hexavigesimal (base 26) system, where a is 0, b is 1, z is 25, aa is 00, ab is 01, etc.)
Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c....
Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals.
See tan:letter-to-number()
and references there to TAN
functions for converting numbering systems.
The TAN validation process attempts to convert all values of @n
to Arabic numerals. Some values
are ambiguously Roman numerals or alphabetic sequences. For example,
c
could mean 3 (alphabetic sequence) or 100 (Roman numeral).
Such numerals are assumed to be Roman, unless you supply a <numerals>
and assign
@priority
to
specify letters
(or roman
).
@n
vocabularyIf you are using @n
to
label the names of books of the Bible or Surahs of the Qur'an, you will run
into the issue of different conventions for @n
. To avoid this long-standing
problem, you may want to use extra TAN vocabulary for @n
. If you include in the head of
your TAN file <vocabulary which="bible eng"/>
, then any
non-numeric values of @n
will be checked against the corresponding TAN-voc file (in this case, the
TAN-voc file at /vocabularies/extra/n.bible.eng.tan-voc.xml
).
This, in turn, will will allow other files to refer to that <div>
by any other <name>
that is a synonym.
For example, in a class-1 file pointing to the TAN English Bible vocabulary
above, a <div type="book" n="matt">...</div>
would be
regarded as containing the work the Gospel of Matthew. Any class-2 file that
refers to that class-1 file as a source may use any synonym listed in the
extra vocabulary file n.bible.eng.tan-voc.xml
, i.e.,
Mt
, Mat
, Matt
, or
Matthew
(or their lowercase equivalents). An extra benefit
of this method is that such <div>
s are also marked as the works, identified by
the <IRI>
s of the target
TAN vocabulary items.
If you use extra TAN vocabulary, it is recommended you include in the
declarations section of your <head>
an <n-alias>
. This element, along with its
@div-type
,
specifies exactly which types of <div>
s are eligible for this kind of aliasing on
@n
. Supplying this
element considerably speeds the validation process on long files.
The goal behind the extra vocabularies is to eliminate the need to worry
about what abbreviations are used to name well-known, unnumbered <div>
s. It is hoped that in
future releases of TAN these extra vocabularies will grow in number and
quality.
Extra TAN @n
vocabularies:
You should declare how you have normalized the transcription via <adjustments>
and its
children, e.g., <normalization>
or <replace>
. (For suggestions on values of <IRI>
for <normalization>
see the section called “TAN keywords for types of normalizations (<normalization>)”.)
Generally speaking, normalization entails the suppression of things extraneous
to or separable from the work-version you have chosen. You are encouraged to omit
parenthetical editorial insertions (especially quotation references), stray
handwritten remarks, discretionary word-breaking hyphens, editorial comments,
inserted cross-references, and reference numerals (page numbers, section numbers,
etc.). If chapter 4 of a text begins "4." or "IV" then leave out that labeling
numeral—you've already indicated it in @n
, so there's no need to clutter the transcription with it.
Remember, scholars who use your file will be concerned with things like
word-for-word alignments and lexico-morphological analysis, and putting in a
modern editor's "4" might contaminate research results. For the same reason, you
should resolve ligatures and correct unintended typographical errors.
The goal is a transcription whose text is free of the interpretive voice of
later editors. You should remove from the text anything that is not part of the
work proper and would interfere with detailed word-for-word alignment, or would
require extra preprocessing or postprocessing work for other users. If you are
breaking a transcription into individual lines, and you are required to break a
word, do so with either the soft hyphen (­
), the zero-width
space (​
), or the zero-width joiner
(‍
). TAN processors that handle the text within a leaf
<div>
will automatically
normalize its space. If either of those two characters are found at the end then
it will be deleted and the text from the next leaf <div>
(if there is one) will
immediately follow without intervening space; if those two characters do not occur
at the end, then a space,  
, will be added, and all other
space will be normalized. For more details, see the section called “Space characters and normalization”.
In a digital source, variable lengths of special spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces (see the section called “Unicode points not allowed”), and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see the section called “Unicode Normalization”).
Variant readings should not be transcribed. For example, a manuscript may have
correctors' marks. Or a set of footnotes (or apparatus criticus) might provide an
alternative reading. In those cases, each set of corrections should be moved to a
separate TAN-T file, or rewritten as <claim>
s of a TAN-A file.
In some ambiguous areas, you can use TAN-TEI both to normalize and to preserve
what is in the scriptum. Suppose, for example, a manuscript has reference numerals
that are sui generis. That is, these reference numbers do not correspond to the
"canonical" reference scheme, and are scribal adjustments to the text's structure
(sometimes mistaken). On the one hand, such reference numerals are metadata, and
should arguably be deleted; on the other, they are part of the text, and witness
to how a text was read and changed over time. A middle-ground approach would move
these references to TAN-TEI's <milestone rend="[TEXT]">
,
substituting [TEXT]
for the reference text. In that way, the numerals
are properly removed from the main text, but the information is retained.
Generally speaking, TEI's @rend
is an excellent way to remove
something from a transcription while keeping it in the file.
Overall, normalization is a difficult, understudied topic. Scholars are not in
the habit of documenting everything they normalize, and sometimes have so
internalized a set of normalizations that they are unaware of them. Not all
decisions will be clear-cut. You may justly hesitate before normalizing
orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode
that permit different conventions may need special consideration. You may need to
deliberate on whether an unusual or rarely used Unicode character might be
misinterpreted or hinder searches. Document any decisions in the <adjustments>
. Whether you
use <normalization>
or <replace>
is up to
you. The former can be used to apply a class of changes to a vocabulary item. The
latter provides a precise, regular-expression-based method of describing exactly
what has been changed, and the order in which those changes took place. Note, a
<replace>
might
help one to reconstruct the path that led from the input to the output, but not
the reverse. If it is important to document exactly what the pre-normalized
version of a text was like, use <predecessor>
or a similar element available in the key
links section of the <head>
(see the section called “Other Related Files”) to point to the original.
If you find it very difficult to bring yourself to normalize to the depth
advised above, try first making a (non-TAN) TEI file, and create the transcription
you have in mind as the ideal. Once that is finished, create a second, TAN
version, and be more aggressive in your normalization, with <see-also>
pointing to the first
approach. Users of your TAN transcription will be more interested in your TAN
version than the TEI version, but you will have at least satisfied your craving to
avoid normalizing.
The footnotes or endnotes in a scriptum should be normalized. Many, most, or
all should likely be deleted. Before deciding, distinguish between those that
are an intrinsic part of the work you're transcribing from those that aren't.
Those that aren't can be removed, or they can be put into a separate TAN-T(EI)
file, perhaps linking the two through <see-also>
, and hopefully structuring both files with
the same reference system, to facilitate alignment. Another way to approach the
task is to convert some or all of the notes you're removing into <TAN-A>
<claim>
s.
Footnotes, endnotes, glosses, or marginalia that are intrinsic parts of the work present special challenges for encoding in general, and normalization in particular.
First is the issue of connecting an annotation to the text annotated. When we encounter a superscript number—a note signal—while reading the text of a printed book, we infer that we are being invited to find a companion footnote, and that footnote comments on the text we have just read. But specifically what text? Is it only the preceding word? Is it a word or phrase that occurs earlier in the sentence? Does the annotation cover earlier sentences, the entire paragraph, or even prior paragraphs? For some notes, identifying the text being annotated requires interpretation.
In a digital file, connecting an annotation to its text cannot be so vague; it requires a decision and a commitment. Here are three possible ways to approach annotations in a TAN file:
Use the <note>
feature of TAN-TEI (see related
TEI documentation). This will allow you to connect the
annotation to merely an anchor in the text, i.e., to no text
whatsover.
<div n="1" type="p"> <p>The process occurred in New York, among other places.<ref rend="1"/> <note><p><ref rend="1"/>On New York, see: X.</p></note> </p> </div>
Move each annotation into a <div>
with a @type
that implies that
it is an annotation (e.g., scholium
) and place it
immediately after the <div>
it annotates.
<div n="1" type="p">The process occurred in New York, among other places.</div> <div n="n1" type="footnote">On New York, see: X.</div>
Note in the example above that n1
is used to make sure
that 1
unambiguously points to only one <div>
.
As #2, but also write a <TAN-A>
file that more precisely connects each
annotation to the text it
annotates.
<claim verb="annotates"> <subject src="text" ref="n1"/> <object src="text"> <from-tok ref="1" val="The"/> <through-tok ref="1" val="York"/> </object> </claim>
The first option is expeditious, and will allow you to be as
precise or imprecise as you like. Validation is not affected, but you should be
aware that the <note>
will be treated as a constituent part of
its parent <div>
. The second
option is also relatively easy, but it entails a decrease in precision. The
third option provides immense precision, permits multiple annotations on the
same text range, and allows notes to target overlapping ranges of text. But the
task could be time-consuming, if only because you will need to determine the
range of text targeted by each annotation, and the targeted text might be quite
messy or vague. You will need to take stock of how precise and comprehensive
you choose to make your connections. (See also accuracy, precision, and
comprehensiveness.)
Remember that the note signals in the main text and in the footnote area are
metadata meant to help readers link corresponding passages of texts, and in the
spirit of normalizing should be deleted. In a TAN-TEI file you can replace a
note signal with <ref>
(see above).