The TAN Validation Process

The TAN Validation Process
Prev	Chapter 8. Working with TAN Files	Next

TAN files are validated when the file, along with its associated TAN schemas, are passed to a validation engine. Validation can be set up either by pointing explicitly to the schemas within a TAN file (via <?xml-model ?> statements in the prolog), or by setting up an oXygen project or framework to automatically apply the schemas to TAN files (see the section called “Local Setup”). There are two types of TAN validation.

Structural validation is conducted through RELAX-NG files that define the attributes, elements, and patterns that are allowed or required in a given TAN format. These files are kept in the schemas project subdirectory. If you are editing a TAN-T file, for example, its RELAX-NG schema is schemas/TAN-T.rnc. The RELAX-NG files are written principally in the compact syntax (.rnc), then converted to XML syntax (.rng). The TAN-TEI format is an exception. The schema begins with schemas/TAN-TEI.odd. This file, linked as it is with the other RELAX-NG files, is processed by TEI stylesheets to generate the master TAN-TEI.rnc and TAN-TEI.rng files that validate TAN-TEI files. The ODD file is processed against TEI All, the largest of the TEI formats, in the version available at the time of the release of a given TAN version.

The second type of validation uses Schematron to check rules that cannot be expressed in RELAX-NG, e.g., no @when should have a date in the future. More than one hundred types of errors are checked during Schematron validation. For a comprehensive list see ../functions/errors/TAN-errors.xml. Some of these errors can be quite time-consuming for a computer to check. For example, if a class-1 file has a <redivision>, the text should be identical. On short texts, the test can be made in seconds; on longer texts it might take minutes. Therefore Schematron validation allows three different levels: terse, normal, and verbose. The names reflect not only how fast each phase takes but how much feedback is provided.

The Schematron files themselves are rather small. The majority of the work is done by a large library of XSLT code that takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and for any suggested corrections (also embedded as children), it returns a Schematron Quick Fix.

TAN's Schematron validation is more computationally intensive than is its RELAX-NG. The longer and more complex your file and its dependencies, the longer its validation will be. Files such as the Ring-a-roses examples in the examples subdirectory will take a split second to validate, but a TAN-T file of the Old Testament of the King James Version has been known to take about 33 seconds to validate in the normal phase (the whole Bible about a minute). A TAN-A-lm file with a full morphological analysis of that long TAN-T file will take a long time to validate.

Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on oXygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate.

	Note
	Much of the expansion is due to the Schematron process itself. The XSLT component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The Schematron component becomes more time-consuming faster than does that of the XSLT.

In future versions of TAN this process will be further optimized (the figures above are a very significant improvement over 2018 figures). For now, you must make decisions that pit speed against convenience. If you wish to have validation happen quickly, break files into smaller ones, perhaps to be joined later in a single TAN file via <inclusion>s. Validating ten component files each with ten thousand elements will take aggregately less time than validating one long file with one hundred thousand elements. Had the example TAN-A file mentioned above been split into 43 different files, the entire collection would have been validated in less than 12% of the time.

The process behind Schematron validation can be used not only for validation but for other applications, so should be explained. Any TAN file that is processed by the TAN XSLT library goes through two major transformations.

The first transformation resolves the file. The goal is to get the file into a state where it can be evaluated without having to consult any <vocabulary> or <inclusion> dependencies. (See the section called “Networked Files” for background on TAN's approach to inclusion.) This process also does some basic file-specific normalization; it will:

Prepare the file. This includes evaluating <alias>, stamping the root element with a base URI (the path location of the file), and every element with a @q (an arbitrary name), which contains a unique identifier. This identifier is used by the Schematron file match an element with any error messages in the corresponding element in the XSLT output.
Identify those nodes that need to be changed by <vocabulary> or <inclusion> dependencies.
Insert required components from <vocabulary>s or <inclusion>s through the following method:
1. Relevant external vocabulary items are inserted into the <head>, either as descendants of the appropriate <vocabulary> or if derived from TAN standard vocabulary as new <tan-vocabulary> elements immediately following the <vocabulary-key>. All vocabulary items are imprinted with an <id>, to facilitate rapid retrieval of vocabulary. Any vocabulary <name> that is not normalized is given a copy that is name-normalized (signaled by @norm): lower-case, hyphens and underscores changed to spaces, and space-normalized.
2. Any element with @include is replaced by the elements of the same name found in the target inclusion document. In addition, <inclusion> is populated with any vocabulary items required to resolve the newly included material (recursively, if that inclusion requires other inclusions). This last point is important, because all IDrefs must be interpreted in light of the original context. IDrefs are brought into the host document, so when you use <inclusion> you must ensure there are no id conflicts.
Normalize all numbers in original components (i.e., excluding included elements or vocabulary items) as Arabic numerals.

Files are resolved recursively. That is, no <vocabulary> or <inclusion> components are imported until the files pointed to are themselves first resolved.

Numerals fall at the end of the process because they might need to be resolved in light of resolved vocabulary and inclusions.

The description above is necessarily generalized. For details consult the function library, particularly ../functions/incl/TAN-core-resolve-functions.xsl. In cases where there is a conflict between the code and the description above, the code is to be interpreted as more current and authoritative.

The second transformation expands the file. The goal is to unpack the components of a resolved document and identify any errors along the way (see the master list of errors). There are three levels of expansion, corresponding to the three levels of Schematron validation: terse, normal, and verbose.

In terse expansion, for each value of an attribute, an element with the attribute's name is placed within the parent (e.g., @type="a b" produces <type>a</type> and <type>b</type>). If the value is an IDref, and it points to an alias, a copy is made for the IDref of each target vocabulary item. If an id reference does not point to a vocabulary item of the expected type, an error message is also copied in the parent. Any values that are ranges are expanded, if need be. Select networked files are checked for basic validity. Class-2 files include a special set of rounds during terse validation, where their sources are adjusted, and then checked against specific references made in the class-2 file. (See the section called “Class 2 Pointer Syntax: Referencing Texts”.) In terse expansion, all pointing mechanisms are checked, to make sure they point to a valid location. Because of this basic requirement, some terse expansion can take a long time on lengthy files, or ones with complex <adjustments>.

Normal expansion builds on terse expansion by interrogating networked files more closely. Any errors that were reported during the terse stage but were suppressed to avoid clutter are enabled.

Verbose expansion generally attends to procedures that are complex, or are not critical to validation. For example, a <model> of a class-1 file will be checked, to find references that one has but is lacking in the other. A class-1 <redivision> will be analyzed, to make sure that the two transcriptions are identical. A catalog file in the same directory will be checked, to see if it has faulty entries.

Many errors lend themselves to solutions that can be recommended by the TAN function library. Some solutions are returned to the Schematron validation method as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g. oXygen XML Editor) can then prompt users to fix an errant section with a quick replacement. For example, if text has not been NFC Unicode-normalized, an SQF will allow a user to make the change in two clicks. Thus, TAN validation does not merely tell you what the problems are; it tries to help fix them.

The term "expansion" describes the process but possibly not the output. If the global parameter $is-validation is true, then in the course of expanding the file the TAN templates will abandon any parts that are no longer needed. The output is normally much smaller than the input file, restricted as it is to the root element and elements that have been marked with errors, warnings, or fixes. So although during validation the file is really being expanded, at the end only a small portion of the expanded file is returned to the Schematron processor, to expedite validation. But if $is-validation is false (the default value, if the file is not being validated), the entire expanded file and its dependencies are returned. Such output can be very useful in applications.

The description above of file expansion is necessarily generalized. For details consult the function library, particularly ../functions/incl/TAN-core-expand-functions.xsl.

The validation rules have been tested not only on the files in the examples subdirectory, but more importantly upon the files in functions/errors. The files there attempt to provide at least one example of every error, and they are validated in reverse: a file is valid if and only if every error has a corresponding comment signaling the error.

Prev	Up	Next
Creating and populating TAN files	Home	Sharing TAN files