TAN files are validated when the file, along with its associated TAN schemas,
are passed to a validation engine. Validation can be set up either by pointing
explicitly to the schemas within a TAN file (via <?xml-model ?>
statements in the prolog), or by setting up an Oxygen project or framework to
automatically apply the schemas to TAN files (see the section called “Installation and local setup”). There are two types of TAN validation.
First, the file structure is checked against RELAX-NG files that define the
attributes, elements, and patterns that are allowed or required in a given TAN
format. These files are kept in the schemas
project subdirectory,
according to format name. If you are editing a TAN-T file, for example, its
RELAX-NG schema is schemas/TAN-T.rnc
.[20]
The second type of validation uses Schematron to apply rules that cannot be
expressed in RELAX-NG, e.g., no @when
should have a date in the future. More than one hundred
types of errors are checked during Schematron validation. For a comprehensive list
see ../functions/errors/TAN-errors.xml and Chapter 14, Errors. Some of these errors can be quite time-consuming for a
computer to check. For example, if a class-1 file has a <redivision>
, the text
should be identical. On short texts, the comparison can be made in seconds; on
longer ones it might take minutes (see next section, on efficiency). Therefore
Schematron validation allows three different levels: terse, normal, and verbose.
The names reflect not only how fast each phase takes but how much feedback is
provided.
The Schematron files themselves are rather small. The majority of the work is done by the TAN function library, which takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file, containing only warnings and errors, is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and any suggested corrections as Schematron Quick Fixes.
For more details about the TAN validation process, see the section called “The mechanics of validation”.
TAN's Schematron validation specifies a process that is much more
computationally intensive than is its RELAX-NG counterpart. The longer and more
complex your TAN file and its dependencies, the longer it will take to validate.
Files such as the Ring-a-roses examples in the examples
subdirectory
will take a split second to validate, but a TAN-T file of the Old Testament of the
King James Version has been known to take about 25 seconds to validate in the
normal phase, and the whole Bible, about a minute. A TAN-A-lm file with a full
morphological analysis of a very long TAN-T file will take a much longer time to
validate.
Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on Oxygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate.[21]
The figures above are a very significant improvement over the time required in
the 2018 version, and no doubt future versions of TAN will bring optimizations to
the validation process. Nevertheless, you may need to make decisions that pit
speed against convenience. If you want validation to be quick, break files into
smaller ones, perhaps to be joined later in a single TAN file via <inclusion>
s. Validating ten
component files each with ten thousand elements will take aggregately less time
than validating one long file with one hundred thousand elements. Had the example
TAN-A file mentioned above been split into 43 different files, the time required
for validating the entire collection would have been reduced by 88%.
[20] The RELAX-NG files are written principally in the compact syntax
(.rnc
), then converted to XML syntax (.rng
).
The TAN-TEI format is an exception. Behind the schema
schemas/TAN-TEI.rnc
is a master file
schemas/TAN-TEI.odd
. This file, linked as it is with the
other RELAX-NG files, is processed by TEI stylesheets to generate the master
TAN-TEI.rnc
and TAN-TEI.rng
files that validate
TAN-TEI files. The ODD file is processed against TEI All, the largest of the
TEI formats, in the version available at the time of the release of a given
TAN version.
[21] Much of the extra time is due to the Schematron evaluation process, not the preparatory work performed by the TAN function library. The library component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The time complexity of the Schematron component grows faster than does that of the XSLT.