The first TAN-T transcription had a longer <head>
than the second one did, and that is because for the
former we used an explicit method, that of specifying every IRI and name, and then in
the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies
are meant not merely to be a convenience; they are intended to avoid problems that
beset projects that create many files with repeated data patterns. When (not if) you
make changes to one file you have to remember all the other places where you might
need to make the same changes. The old programmer's adage "Don't repeat yourself"
(DRY) is operative here. If there is a repeating data pattern, put it in one master
place, and let the other files point to that pattern. When we make changes, we do so
only at a single place.
The previous examples drew from standard TAN vocabulary, which is written in one
of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc
files in the project subdirectory called vocabularies
. We can write our
own TAN-voc files, to collect the vocabulary items that we will use repeatedly from
one file to the next. For example:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="../../schemas/TAN-voc.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="../../schemas/TAN-voc.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:TAN-voc:standard"> <head> <name>Keywords for TAN files edited by Jenny Park</name> <license licensor="park" which="by 4.0"/> <vocabulary-key> <person which="Jenny Park" xml:id="park"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2019-10-08" who="park">Started file</change> <to-do> <comment when="2020-01-04" who="park">Need to check files for new vocabulary items.</comment> </to-do> </head> <body> <group affects-element="person"> <item> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </item> </group> <item affects-element="work"> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring a Ring o' Roses</name> <name>Ring Around the Rosie</name> </item> </body> </TAN-voc>
In this example case, updates have been made to @id
and <name>
, and a <comment>
has been added to <to-do>
. The most significant difference is the <body>
, which has two <item>
s, one of which is wrapped in
a <group>
. Each @affects-element
specifies
one or more names of elements that the enclosed items affect, and the <item>
s have the standard IRI + name
pattern. <group>
s may nest as
you like.
The difference between a grouped and ungrouped <item>
is purely a matter of taste and convenience. The
example above illustrates both methods.
The <vocabulary-key>
has a <person>
whose @which
points to the body of the
first <item>
. That is, a
TAN-voc file can use its own vocabulary, without repeating it in <vocabulary-key>
.
Let's return to the <head>
s of
our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary
file.
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring01"> <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <div-type xml:id="line" which="line (verse)"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> <to-do/> </head> . . . . . . . </TAN-T>
<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T>
In each TAN-T file, a new <vocabulary>
points to the project TAN-voc vocabulary file
we have just created. Along with the customary IRI + name pattern is a new element,
<location>
, which
specifies where the digital file was accessed and when (through @accessed-when
). We may include
as many of these <location>
elements as we wish, with the most preferred or reliable one at the top. The
validation process will consult only the first one that leads to an available
document. The @accessed-when
value is important, because the validator will look
for changes in the file since we last accessed it, and if any changes are found a
warning with a summary of the changes will be returned. It is then up to us to
determine if the alterations merit any action on our part.
Similarly, anyone using or dependending upon our file will be notified of any changes we make, through the same validation process.
Once the <vocabulary>
is in place, we can draw from our predefined vocabulary. Hence, these revised
versions of the <head>
s are a bit
more compact and easier to read. The longer the TAN file, the more noticable the
improvement. And when our library grows into dozens of files, we'll be grateful that
a change that affects all the files needs to be made only once.
Now that we have created the metadata for our transcriptions, let's turn to the
alignment files. Those <head>
s
will look slightly different, because they are not concerned with transcriptions per
se. We start with the TAN-A
file:
<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A>
Much of the code above will look similar to the previous two examples. The file's
<name>
and <master-location>
are
updated. Just like TAN-T files have <source>
s, so TAN-A files do as well, except that those
sources are always TAN-T transcription files, and they take the IRI + name + location
pattern we saw above in <vocabulary>
. Because alignment files take only TAN
transcription files as sources, each <source>
's <IRI>
always takes the @id
value of the target TAN-T transcription file. <name>
is arbitrary. It may replicate
exactly the title found in the transcription file, or it may be modified, perhaps to
harmonize better with the descriptions of the other source names. Our TAN-A file
could have any number of <source>
s, and not necessarily for the same work. The order in
which we put the <source>
s does
not necessarily mean anything.
This <head>
explains why the
<body>
of our TAN-A file is
allowed to be empty. We have already specified which sources are to be aligned and
where they are to be found. Any user or processor of a TAN-A file may assume that
every <div>
in every source should
be automatically aligned upon the basis of shared values of @n
.
Meanwhile we turn to our fourth file, TAN-A-tok, whose <head>
might look like
this:
<TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml"/> <license which="by-nc-nd_4.0" rights-holder="park"/> <token-definition src="ring1881 ring1987" which="letters"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/> <token-definition src="ring1881 ring1987" which="letters"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <change when="2015-01-20" who="park">Started file</change> </head> . . . . . . </TAN-A-tok>
The TAN-A-tok <head>
looks
similar to the previous examples, except that <vocabulary-key>
has some new
content.
<bitext-relation>
states through @which
or an IRI +
name pattern the stemmatic relationship we think holds between the two sources. We
have used @which
and the value
a/x+/b
, pointing to a standard TAN vocabulary item for bitext relations:
<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020"
id="tag:textalign.net,2015:tan-voc:bitext-relation">
. . . . . .
<item>
<IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI>
<name>a/x+/b</name>
<desc>direct descent, B descends from A, one or more mediaries</desc>
</item>
. . . . . .
</TAN-voc>
<token-definition>
specifies how we have defined our word tokens. @src
has more than one value, specifying that the same
tokenization rule should be applied to both sources. @which
points to this standard TAN vocabulary item:
<TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020"
id="tag:textalign.net,2015:tan-voc:tokenizations">
. . . . . .
<item>
<token-definition pattern="[\w­​‍]+"/>
<name>letters</name>
<name>letters only</name>
<name>general word characters only</name>
<name>general ignore punctuation</name>
<name>gwo</name>
<desc>General tokenization pattern for any language, words only. Non-letters
such as punctuation are ignored.</desc>
</item>
. . . . . .
</TAN-voc>
Up until now, all vocabulary items have taken the IRI + name pattern. The one
above does not have an IRI, only a <token-definition>
with a @pattern
. The value of @pattern
, which may look like
gibberish, is a regular expression. "Regular" here
does not mean ordinary; rather it derives from the Latin regula,
rule. Regular expressions are rule-based patterned text searches. This particular
pattern says that a token is defined as any contiguous string of word characters
(\w
), soft hyphens (­
), zero-width spaces
(​
), or zero-width joiners (‍
).
This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok
file that lacks a <token-definition>
. TAN adopts this default because in
ordinary conversation, when we refer to the nth word in a sentence, we most often
ignore punctuation marks. For more on token definitions see the section called “Defining Words and Tokens” and the section called “TAN keywords for types of token definitions (<token-definition>)”. See also the section called “Regular Expressions”.
In our <vocabulary-key>
we could have also included a <reuse-type>
, but we have
intentionally omitted it here, because we have <body
bitext-relation="B-descends-from-A" reuse-type="general_adaptation">
. The
value for @reuse-type
,
general_adaptation
, corresponds to a <name>
in a standard TAN vocabulary
item for reuse types. We don't need to invoke a <reuse-type>
in the <vocabulary-key>
because we
have opted not to give it an @xml:id
. Notice that general_adaptation
has an
underscore instead of a space. That's because <reuse-type>
can take multiple
values, which are signified by spaces. We could have used a hyphen instead of an
underscore, if we preferred. The values of <name>
are never case-sensitive, and the space, hyphen, and
underscore are treated as equivalent.