TAN validation

TAN validation
Prev	Chapter 8. Working with TAN files	Next

The process

TAN files are validated when the file, along with its associated TAN schemas, are passed to a validation engine. Validation can be set up either by pointing explicitly to the schemas within a TAN file (via <?xml-model ?> statements in the prolog), or by setting up an Oxygen project or framework to automatically apply the schemas to TAN files (see the section called “Installation and local setup”). There are two types of TAN validation.

First, the file structure is checked against RELAX-NG files that define the attributes, elements, and patterns that are allowed or required in a given TAN format. These files are kept in the schemas project subdirectory, according to format name. If you are editing a TAN-T file, for example, its RELAX-NG schema is schemas/TAN-T.rnc.^[20]

The second type of validation uses Schematron to apply rules that cannot be expressed in RELAX-NG, e.g., no @when should have a date in the future. More than one hundred types of errors are checked during Schematron validation. For a comprehensive list see ../functions/errors/TAN-errors.xml and Chapter 14, Errors. Some of these errors can be quite time-consuming for a computer to check. For example, if a class-1 file has a <redivision>, the text should be identical. On short texts, the comparison can be made in seconds; on longer ones it might take minutes (see next section, on efficiency). Therefore Schematron validation allows three different levels: terse, normal, and verbose. The names reflect not only how fast each phase takes but how much feedback is provided.

The Schematron files themselves are rather small. The majority of the work is done by the TAN function library, which takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file, containing only warnings and errors, is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and any suggested corrections as Schematron Quick Fixes.

For more details about the TAN validation process, see the section called “The mechanics of validation”.

Efficiency

TAN's Schematron validation specifies a process that is much more computationally intensive than is its RELAX-NG counterpart. The longer and more complex your TAN file and its dependencies, the longer it will take to validate. Files such as the Ring-a-roses examples in the examples subdirectory will take a split second to validate, but a TAN-T file of the Old Testament of the King James Version has been known to take about 25 seconds to validate in the normal phase, and the whole Bible, about a minute. A TAN-A-lm file with a full morphological analysis of a very long TAN-T file will take a much longer time to validate.

Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on Oxygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate.^[21]

The figures above are a very significant improvement over the time required in the 2018 version, and no doubt future versions of TAN will bring optimizations to the validation process. Nevertheless, you may need to make decisions that pit speed against convenience. If you want validation to be quick, break files into smaller ones, perhaps to be joined later in a single TAN file via <inclusion>s. Validating ten component files each with ten thousand elements will take aggregately less time than validating one long file with one hundred thousand elements. Had the example TAN-A file mentioned above been split into 43 different files, the time required for validating the entire collection would have been reduced by 88%.

^[20]The RELAX-NG files are written principally in the compact syntax (.rnc), then converted to XML syntax (.rng). The TAN-TEI format is an exception. Behind the schema schemas/TAN-TEI.rnc is a master file schemas/TAN-TEI.odd. This file, linked as it is with the other RELAX-NG files, is processed by TEI stylesheets to generate the master TAN-TEI.rnc and TAN-TEI.rng files that validate TAN-TEI files. The ODD file is processed against TEI All, the largest of the TEI formats, in the version available at the time of the release of a given TAN version.

^[21]Much of the extra time is due to the Schematron evaluation process, not the preparatory work performed by the TAN function library. The library component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The time complexity of the Schematron component grows faster than does that of the XSLT.

Prev	Up	Next
Creating and populating TAN files	Home	Sharing TAN files