TAN files are validated when the file, along with its associated TAN schemas, are
passed to a validation engine. Validation can be set up either by pointing explicitly
to the schemas within a TAN file (via <?xml-model ?>
statements in
the prolog), or by setting up an oXygen project or framework to automatically apply
the schemas to TAN files (see the section called “Local Setup”). There are two types
of TAN validation.
Structural validation is conducted through RELAX-NG files that define the
attributes, elements, and patterns that are allowed or required in a given TAN
format. These files are kept in the schemas
project subdirectory. If you
are editing a TAN-T file, for example, its RELAX-NG schema is
schemas/TAN-T.rnc
. The RELAX-NG files are written principally in the
compact syntax (.rnc
), then converted to XML syntax (.rng
).
The TAN-TEI format is an exception. The schema begins with
schemas/TAN-TEI.odd
. This file, linked as it is with the other
RELAX-NG files, is processed by TEI stylesheets to generate the master
TAN-TEI.rnc
and TAN-TEI.rng
files that validate TAN-TEI
files. The ODD file is processed against TEI All, the largest of the TEI formats, in
the version available at the time of the release of a given TAN version.
The second type of validation uses Schematron to check rules that cannot be
expressed in RELAX-NG, e.g., no @when
should have a date in the future. More than one hundred types
of errors are checked during Schematron validation. For a comprehensive list see
../functions/errors/TAN-errors.xml. Some of these errors can
be quite time-consuming for a computer to check. For example, if a class-1 file has a
<redivision>
, the
text should be identical. On short texts, the test can be made in seconds; on longer
texts it might take minutes. Therefore Schematron validation allows three different
levels: terse, normal, and verbose. The names reflect not only how fast each phase
takes but how much feedback is provided.
The Schematron files themselves are rather small. The majority of the work is done by a large library of XSLT code that takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and for any suggested corrections (also embedded as children), it returns a Schematron Quick Fix.
TAN's Schematron validation is more computationally intensive than is its
RELAX-NG. The longer and more complex your file and its dependencies, the longer its
validation will be. Files such as the Ring-a-roses examples in the
examples
subdirectory will take a split second to validate, but a
TAN-T file of the Old Testament of the King James Version has been known to take
about 33 seconds to validate in the normal phase (the whole Bible about a minute). A
TAN-A-lm file with a full morphological analysis of that long TAN-T file will take a
long time to validate.
Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on oXygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate.
Note | |
---|---|
Much of the expansion is due to the Schematron process itself. The XSLT component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The Schematron component becomes more time-consuming faster than does that of the XSLT. |
In future versions of TAN this process will be further optimized (the figures
above are a very significant improvement over 2018 figures). For now, you must make
decisions that pit speed against convenience. If you wish to have validation happen
quickly, break files into smaller ones, perhaps to be joined later in a single TAN
file via <inclusion>
s.
Validating ten component files each with ten thousand elements will take aggregately
less time than validating one long file with one hundred thousand elements. Had the
example TAN-A file mentioned above been split into 43 different files, the entire
collection would have been validated in less than 12% of the time.
The process behind Schematron validation can be used not only for validation but for other applications, so should be explained. Any TAN file that is processed by the TAN XSLT library goes through two major transformations.
The first transformation resolves the file. The goal is to
get the file into a state where it can be evaluated without having to consult any
<vocabulary>
or
<inclusion>
dependencies. (See the section called “Networked Files” for background
on TAN's approach to inclusion.) This process also does some basic file-specific
normalization; it will:
Prepare the file. This includes evaluating <alias>
, stamping the root
element with a base URI (the path location of the file), and every element
with a @q
(an arbitrary name), which contains a unique
identifier. This identifier is used by the Schematron file match an element
with any error messages in the corresponding element in the XSLT
output.
Identify those nodes that need to be changed by <vocabulary>
or
<inclusion>
dependencies.
Insert required components from <vocabulary>
s or <inclusion>
s through the
following method:
Relevant external vocabulary items are inserted into the
<head>
, either as descendants of the
appropriate <vocabulary>
or if derived from TAN
standard vocabulary as new <tan-vocabulary>
elements immediately following the <vocabulary-key>
. All vocabulary items are
imprinted with an <id>
, to facilitate rapid
retrieval of vocabulary. Any vocabulary <name>
that is not
normalized is given a copy that is name-normalized (signaled by
@norm
): lower-case, hyphens and underscores changed
to spaces, and space-normalized.
Any element with @include
is replaced by the elements of the
same name found in the target inclusion document. In addition,
<inclusion>
is populated with any vocabulary
items required to resolve the newly included material (recursively,
if that inclusion requires other inclusions). This last point is
important, because all IDrefs must be interpreted in light of the
original context. IDrefs are brought into the host document, so
when you use <inclusion>
you must ensure there are no id
conflicts.
Normalize all numbers in original components (i.e., excluding included elements or vocabulary items) as Arabic numerals.
Files are resolved recursively. That is, no <vocabulary>
or <inclusion>
components are
imported until the files pointed to are themselves first resolved.
Numerals fall at the end of the process because they might need to be resolved in light of resolved vocabulary and inclusions.
The description above is necessarily generalized. For details consult the function library, particularly ../functions/incl/TAN-core-resolve-functions.xsl. In cases where there is a conflict between the code and the description above, the code is to be interpreted as more current and authoritative.
The second transformation expands the file. The goal is to unpack the components of a resolved document and identify any errors along the way (see the master list of errors). There are three levels of expansion, corresponding to the three levels of Schematron validation: terse, normal, and verbose.
In terse expansion, for each value of an attribute, an element with the
attribute's name is placed within the parent (e.g., @type="a b"
produces
<type>a</type>
and <type>b</type>
). If the
value is an IDref, and it points to an alias, a copy is made for the IDref of each
target vocabulary item. If an id reference does not point to a vocabulary item of the
expected type, an error message is also copied in the parent. Any values that are
ranges are expanded, if need be. Select networked files are checked for basic
validity. Class-2 files include a special set of rounds during terse validation,
where their sources are adjusted, and then checked against specific references made
in the class-2 file. (See the section called “Class 2 Pointer Syntax: Referencing Texts”.) In terse expansion,
all pointing mechanisms are checked, to make sure they point to a valid location.
Because of this basic requirement, some terse expansion can take a long time on
lengthy files, or ones with complex <adjustments>
.
Normal expansion builds on terse expansion by interrogating networked files more closely. Any errors that were reported during the terse stage but were suppressed to avoid clutter are enabled.
Verbose expansion generally attends to procedures that are complex, or are not
critical to validation. For example, a <model>
of a class-1 file will be checked, to find
references that one has but is lacking in the other. A class-1 <redivision>
will be analyzed,
to make sure that the two transcriptions are identical. A catalog file in the same
directory will be checked, to see if it has faulty entries.
Many errors lend themselves to solutions that can be recommended by the TAN function library. Some solutions are returned to the Schematron validation method as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g. oXygen XML Editor) can then prompt users to fix an errant section with a quick replacement. For example, if text has not been NFC Unicode-normalized, an SQF will allow a user to make the change in two clicks. Thus, TAN validation does not merely tell you what the problems are; it tries to help fix them.
The term "expansion" describes the process but possibly not the output. If the
global parameter $is-validation
is true, then in the course of expanding
the file the TAN templates will abandon any parts that are no longer needed. The
output is normally much smaller than the input file, restricted as it is to the root
element and elements that have been marked with errors, warnings, or fixes. So
although during validation the file is really being expanded, at the end only a small
portion of the expanded file is returned to the Schematron processor, to expedite
validation. But if $is-validation
is false (the default value, if the
file is not being validated), the entire expanded file and its dependencies are
returned. Such output can be very useful in applications.
The description above of file expansion is necessarily generalized. For details consult the function library, particularly ../functions/incl/TAN-core-expand-functions.xsl.
The validation rules have been tested not only on the files in the
examples
subdirectory, but more importantly upon the files in
functions/errors
. The files there attempt to provide at least one
example of every error, and they are validated in reverse: a file is valid if and
only if every error has a corresponding comment signaling the error.