The mechanics of validation

The mechanics of validation
Prev	Chapter 10. Developing with TAN	Next

In many cases, developers will want to work with TAN files, either as input or as output. But TAN files have a number of distinctive constructions: two different methods of inclusion (see the section called “Networked Files”), space-normalization rules (see the section called “Space characters and normalization”), numeration systems (see the section called “One reference system”), tokenization systems (see the section called “Defining words and tokens”), and pointing systems (see the section called “Class 2 pointer syntax: referencing texts”). You can work directly with raw TAN files, but you run the risk of misinterpreting the file.

Every TAN file is definitively interpreted through the TAN functions that undergird the Schematron validation process (see the section called “TAN validation”). That process is a core part of the standard TAN utilities and applications, and it determines the nature of some of the more important global variables.

Every TAN file is subject to two major transformations, both for validation and for applications.

Resolution

The first transformation resolves the file. The goal is to get the file into a state where it can be understood on its own terms. A resolved TAN file contains all its relevant vocabulary and components. It can be evaluated without having to consult the files referred to by <vocabulary> or <inclusion> dependencies. (See the section called “Networked Files” for background on TAN's approach to inclusion.) This process also does some basic file-specific normalization; it will:

Prepare the file. This includes stamping the root element with a base URI (the path location of the file), evaluating <alias>, and inserting into every element a @q that contains a identifier unique to the element. This identifier is used by the Schematron file to match an element with any error messages in the corresponding element in the XSLT output.
Insert required components from <vocabulary>s or <inclusion>s using the following method:
1. Relevant external vocabulary items are inserted into the <head>, either as descendants of the appropriate <vocabulary> or if derived from TAN standard vocabulary as new <tan-vocabulary> elements immediately following the <vocabulary-key>. All vocabulary items are imprinted with an <id> corresponding to an @xml:id from any corresponding entry from <vocabulary-key>, to facilitate rapid retrieval of vocabulary. Any vocabulary <name> that is not normalized is duplicated with a name-normalized copy (signaled by @norm): lower-case, hyphens and underscores changed to spaces, and space-normalized.
2. Any element with an @include is replaced by the elements of the same name found in the target inclusion document (constructed recursively if need be). In addition, <inclusion> (in the head) is populated with any vocabulary items required to resolve the newly included material (recursively, if need be). This last point is important, because all idrefs must be interpreted in light of the original context. Included idrefs are made available to the host document, so when you use <inclusion> you must ensure there are no id conflicts.
Normalize all numbers in original components (i.e., excluding included elements or vocabulary items) as Arabic numerals.

Files are resolved recursively. That is, no <vocabulary> or <inclusion> components are incorporated or processed until the files pointed to are themselves first resolved.

Numerals fall at the end of the process because they might need to be resolved in light of resolved vocabulary and inclusions.

The description above is necessarily generalized. For details consult the function library, particularly the functions/resolution directory. In cases of conflict between the code and the description above, the code should be given priority.

Expansion

The second transformation expands the resolved file. You must resolve a TAN file before you try to expand it. The goal behind expansion is to unpack the components of a resolved document and identify any errors along the way (see the master list of errors). There are three levels of expansion, corresponding to the three levels of Schematron validation: terse, normal, and verbose.

In terse expansion, for each value of an attribute, an element with the attribute's name is placed within the parent (e.g., @type="a b" produces <type>a</type> and <type>b</type>). If the value is an IDref, and it points to an alias, a copy is made for the idref of each target vocabulary item. If an idref does not point to a vocabulary item of the expected type, an error message is also copied in the parent. Any values that are ranges are expanded, if need be. Select networked files are checked for basic validity. Class-2 files undergo a extra rounds of processing during terse validation: sources are adjusted if need be, and then checked against references in the host class-2 file. (See the section called “Class 2 pointer syntax: referencing texts”.) In terse expansion, all pointing mechanisms are checked. Because of this basic requirement, some terse expansion can take a long time on lengthy files, or ones with complex <adjustments>.

Normal expansion builds on terse expansion by interrogating networked files more closely. Any errors that were reported during the terse stage but were suppressed to avoid clutter are enabled.

Verbose expansion generally attends to procedures that are complex, or are not essential parts of a validation report. For example, a <model> of a class-1 file will be checked, to find references that one has but is lacking in the other. A class-1 <redivision> will be analyzed, to make sure that the two transcriptions are identical. A catalog file in the same directory will be checked, to see if it has faulty entries.

Many errors lend themselves to solutions that can be recommended by the TAN function library. Some solutions are returned to the Schematron validation method as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g., Oxygen XML Editor) can then prompt users to quickly fix an errant section. For example, if text has not been NFC Unicode-normalized, an SQF will allow a user to make the change in two clicks. Thus, TAN validation does not merely tell you what the problems are; it tries to help fix them.

The term "expansion" describes the process but possibly not the output. If the global parameter $tan:validation-mode-on is true, then in the course of expanding the file the TAN templates will abandon any parts that are no longer needed. The output is normally much smaller than the input file, restricted as it is to the root element, which merely wraps errors, warnings, or fixes. So although during validation the file is really being expanded, at the end only a small portion of the expanded file is returned to the Schematron processor, to expedite validation. But if $tan:validation-mode-on is false (the default value), the entire expanded file and its dependencies are returned. Such output can be very useful in applications.

The preceding description about expansion is necessarily generalized. For details consult the function library, especially functions/expansion.

Prev	Up	Next
Using TAN functions	Home	Using TAN global variables