TAN Utilities

TAN Utilities
Prev	Chapter 9. Using TAN Applications and Utilities	Next

Body Builder

Location: utilities/Body%20Builder/Body%20Builder.xsl

Suppose you have texts, aspects of whose syntax, structure, or format correspond to TAN or TEI elements or markup. This application allows you to write regular-expression-based rules to convert that text into a TAN or TEI format. Input consists of one or more files in plain text, XML, or Word docx. The input is processed against each rule, in order of appearance, progressively structuring the text. Body Builder is intended for intermediate and advanced users who are comfortable with regular expressions and XML markup. The application is ideal for cases where complex, numerous, or lengthy documents need to be converted into TAN or TEI, as well as for developing workflows where live, ever-changing work needs to be regularly pushed into a TAN or TEI format.

Version 2021-07-13

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: a TAN-T or TAN-TEI file that represents a target template for the parsed content coming from the secondary input

Secondary input: one or more non-TAN files in plain text, XML, or Word format (docx); perhaps configuration files for the parameters

Primary output: the primary output with its contents replaced by a tree parsed by applying rules to the source

Secondary output: none

This application is intended to help users convert a text to TAN-T, TAN-TEI, or TAN-A. This is a difficult task, mainly because the source text could be either plain text, an XML file, or a Word document, which requires either going from unstructured to structured text, or from one type of structure to another. If a Word document, the formatting might mean something, or it might not. Structure might be embedded in the text, or in formatting, or both. Users tend to be inconsistent and incomplete, and the docx format has challenges not apparent to the user. The XML structure in a Word file might break up adjacent text identically formatted, because it is preserving a record of editing history, or noting where the cursor was when the document was last edited. In sum, one should not take for granted the challenge of building a pipeline from pre-TAN/TEI files to TAN/TEI ones!

The "plain" text itself poses challenges. We assume that there are in the text various numerals or words that signal reference numbers. But there are thousands of ways an editor might choose to use those reference numbers. Some editors interleave into a single document multiple overlapping or competing reference systems. A TAN file allows only one primary tree, so only one of those reference systems can be used.

Body Builder handles these problems by allowing the editor to declare a sequence of patterns in the text that are the key to the textual hierarchy. To build that sequence of patterns properly, you must have a very good command of regular expressions. To get you started, some examples have been provided, based on actual conversions into TAN from challenging real-world documents.

This utility has been designed based on select test cases, and there are no doubt many ways it could be developed and enhanced. If you encounter a problem, raise a ticket in the GitHub account.

Some assumptions:

If the catalyzing input file is not a TAN file, then a fallback, generic TAN file should be used; the specific one is determined by parameters below.
If the catalyzing input file is TAN-T, that means the output will be as well, and only the structured but plain text will be returned, because TAN-T does not have any internal markup.
If the catalyzing input file is TAN-TEI, the TAN-TEI output will be structured text, with select internal markup. To coordinate the features of your text with specific TEI markup may require testing with the parameters below.
If the catalyzing input is TAN-A, then output will consist of nothing, at the moment. When this feature is eventually supported, the output TAN-A file will contain structured annotations on the text. This option will be supported only for Word files, whose comments will be interpreted as TAN-A claims.

Some tips:

Build the parameters incrementally. You will find that two or three of the parameters below are a challenge to get right, especially for complicated documents. Begin with one or two components, test the output, then add more components.
If building up the components in $main-text-to-markup, start with the most general rule first, but put it at the end of the list. Incrementally add more specific rules, placing them before the more general ones.
If you find that the output doesn't match what you intended, try commenting out some of the elements in $main-text-to-markup..
Look out for problems in your source document. Sometimes this application results in erroneous output not because of the application, but because the input is not what you expected. In fact, if you are working with live documents that others are providing you, this application may help you identify inconsistencies and problems in that input.
If there are certain recurrent errors, you can actually plan for them. See the separate CLIO configuration file, which inserts the illegal <unexpected> to signal a problem.

Nota bene:

Many input files will be full of internal inconsistency and error. Do not take results at face value. Scrutinize the output. Sometimes this will reveal that the problem originates with the input: typos, inconsistencies, bad formatting, etc. If you see errors in the input, you can either (1) fix the input or (2) customize this application to make those changes during processing. Option 2 is definitely to be preferred if the source text is a live, working document that you have little control over, and there is even the slightest chance it might be revised, and need to be processed again.
This application works well with a TAN file that points to the source file in question, via <source> or <predecessor>. As that source file gets updated, the TAN file can be re-processed through this application, to refresh the results.
Currently, this application focuses only select Word docx components: the main text, comments, deletions, insertions. No support is yet provided for the header, footer, footnotes, endnotes.
This application was developed in tandem with two sets of actual workflows, whose results have been documented in the files in the config subdirectory. No doubt other concrete examples will cause this application to grow and change, or bring out bugs. Feel free to register problems or feature requests via github.

Warning: certain features have yet to be implemented

Anchor comments to gaps between characters, so they are not lost when the anchored text is lost.
Support HTML input
Support ODT input
Let the default template be a document with the root element body.
Support parsing of docx endnotes and footnotes.
Demonstrate how to convert a raw index to TAN-A.

Body Remodeler

Location: utilities/Body%20Remodeler/Body%20Remodeler.xsl

Suppose you have a text in a well-structured TAN-T file, and you want to use it to model the structure of another version of that same work. This application will take the input, and infuse the text into the structure of the model, using the proportionate lengths of the model's text as a guide where to break the new text. Any two versions of a single work, particularly translations, paraphrases, and other versions, rarely correlate. A translator may begin a work being relatively verbose, and become more economical in later parts. Such uneven correlation means that one-to-one modeling is not a good strategy for aligning the new text. Rather, one should start with the topmost structures and working progressively toward the smallest levels. Body Remodeler supports such an incremental approach, and allows you to restrict the remodeling activity to certain parts of a text. When used in tandem with the TAN editing tools for Oxygen, which allow you to push and pull words, clauses, and sentences from one leaf div to another, you will find that Body Builder can save you hours of editorial work.

Version 2021-07-13

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: preferably a TAN-T or TAN-TEI file

Secondary input: a TAN-T or TAN-TEI file that has model div and reference system

Primary output: the model, with its div structure intact, but the text replaced with the text of the input, allocated to the new div structure proportionate to the model's text length

Secondary output: none

Nota bene:

If the catalyzing input file is not a class-1 file, but just an XML file, it will be read for its string value. The output will be a copy of the model with the string proportionately allocated to its body components.
If you remodel a set of sibling leaf divs but exclude certain intervening leaf divs from being remodeled, the entire remodel will be placed at the location of the first leaf div only. That is, that area of the remodel will be consolidated, and the text will no longer reflect the original order.
Because this application produces TAN output, metadata will be supplied to the output, along with a change entry, crediting/blaming the application.
Comparison is made with the model on the basis of resolved, not expanded, class 1 files, and any matches involving @n or @n-built references will be on the basis of resolved numerals.
Although the model can be a TAN-TEI file, refining the output will not be possible using the TAN Oxygen editor tools, because pushing a word, clause, or sentence from one leaf div to another will inevitably require splitting and rejoining the host elements. Such a utility is possible, but would require resources for development.

Warning: certain features have yet to be implemented

Support the complete-the-square method (model has a redivision that matches the input's div structure)
Test, troubleshoot against various TEI models

Strategies for use

Method: gentle increments

Use this method in tandem with the TAN editing tools in Oxygen, where you can easily push and pull entire words, clauses, and sentences from one leaf div to another. When you are editing (##2, 5), place the model in a parallel window.

Run plain text against the model.
Edit the output, focusing only on getting the top-level divisions correct.
Change the parameter $preserve-matching-ref-structures-up-to-what-level to 1.
Run the edited input against the model again. Your top-level divisions should remain intact.
Edit the output, focusing only on getting the 2nd-level divisions correct.
Repeat ##3-5 through the rest of the hierarchy.

Working with non-XML input: You might have text from some non-XML source that you want to feed into this method. If you can get down to the plain text, put it into any XML file, and run it through this application, changing the parameter $model-uri-relative-to-catalyzing-input to specify exactly where the model is. You'll get the model with the text infused. It will need a lot of metadata editing, but at least you'll have a good start for structuring the body.

Body Sync

Location: utilities/Body%20Sync/Body%20Sync.xsl

Version 2021-07-07

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: a class 1 file with a redivision element in the head

Secondary input: the redivision

Primary output: the primary input, with the text of its body revised to match the text in the chosen redivision

Secondary output: none

Nota bene:

The comparison can be made only on the basis of space-normalized comparisons, which means that the output will have leaf divs without any internal indentation.
If there are any special end-of-div characters to insert, they will be rendered as hexadecimal codepoint entities.
Comments and processing instructions inside the body will be retained. If you choose to mark alterations, make sure there aren't already some in your file, otherwise it will all get mixed up.

Catalog Creator

Location: utilities/Catalog%20Creator/Catalog%20Creator.xsl

Version 2021-07-07

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: any XML file

Secondary input: none

Primary output: perhaps diagnostics

Secondary output: a new catalog file for select files in the input file's directory, and perhaps subdirectories; if the collection is TAN-only, the filename will be catalog.tan.xml, otherwise it will be catalog.xml

Every catalog file is an XML file with a root element <collection> with children elements <doc>. Both <collection> and <doc> are in no namespace. <doc> can contain anything, but it is arbitrary.

Nota bene:

Files with the name catalog.tan.xml and catalog.xml will be ignored.

Only files available as an XML document will be catalogued.

File Copier

Location: utilities/File%20Copier/File%20Copier.xsl

Version 2021-07-07

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: any XML file

Secondary input: none (but see parameters)

Primary output: none

Secondary output: the file copied to the target location, but with all relative @hrefs revised in light of the target location

Nota bene:

Links are based on common constructs. Resolution of @href is applied everywhere, but @src, only in HTML files.
Processing instructions will be parsed for values assigned to any href pseudo-attribute.

TAN-A-lm Builder

Location: utilities/TAN-A-lm%20Builder/TAN-A-lm%20Builder.xsl

Well-curated lexico-morphological data is highly valuable for a variety of applications such as quotation detection, stylometric analysis, and machine translation. This application will process any TAN-T or TAN-TEI file through existing TAN-A-lm language libraries, and online search services, looking for the best lexico-morphological profiles for the file's tokens.

Version 2021-09-06

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: a class 1 file

Secondary input: a TAN-A-lm template; language catalogs; perhaps language search services

Primary output: a new TAN-A-lm file freshly populated with lexicomorphological data, sorted with unmatched tokens at the top, followed by ambiguous ones, followed by non-ambiguous ones

Secondary output: none

Optimization strategies adopted:

Minimize the number of times files in the language catalog must be consulted and resolved
A hit on @val in a local TAN-A-lm file precludes any follow-up searches based @rgx or online search services

Nota bene:

There must be access to a language catalog, i.e., a collection of TAN-A-lm files that are language specific.
The TAN-A-lm is relied upon as dictating the settings for the file, e.g., tokenization pattern, TAN-mor morphology, etc.
We assume that a search for lexico-morphological data will entail a lot of different TAN-A-lm files with a number of conventions. Codes found in language catalogs must be converted to TAN-standardized feature names, and then reconverted into the codeset of choice, dictated by the <morphology> in the template TAN-A-lm file.

Warning: certain features have yet to be implemented

What if the @xml:lang of the input doesn't match TAN-mor or language catalog files?
What if a morphology has @which? Will it still work?
Ensure the responsible repopulation of the metadata of the template
Support false value for $retain-morphological-codes-as-is

TAN-A-lm Calibrator

Location: utilities/TAN-A-lm%20Calibrator/TAN-A-lm%20Calibrator.xsl

This application is useful when editing TAN-A-lm files. Very frequently, when using local language resources to generate a fresh TAN-A-lm file for a class-1 file, the results are very dirty. Cleaning up the file normally involves deleting many entries, so that alternative options' certainty rates no longer add to a whole 1.0. Or perhaps certainty has not even been set, and it needs to be added. This application will refresh the certainty rates of a TAN-A-lm, making it more useful for applications that rely on certainty rates for scoring, such Tangram. A second way this may be useful is for edits to language-specific TAN-A-lm file, where you might be recalibrating the certainty values of some lm combinations. Perhaps a wordform that has ten lexicomorphological resolutions, each one with a detailed @cert value. You want to promote one of the options as being slightly more probable, but you do not want to recalculate all the values so they add to 1.0. You can increase or decrease the @cert value of an option, then run the file through this application to recalibrate all entries so they add to 1.0 certainty.

Version 2021-07-07

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Description

Primary input: any TAN-A-lm file

Secondary input: none

Primary output: the TAN-A-lm file with certainty recalibrated

Secondary output: none.

Warning: certain features have yet to be implemented

Look at ways to adjust tok certainty

Nota bene:

Input is not resolved ahead of time, so inclusions are ignored.
Calibration is not applied to <tok>, only to <lm>s within any <ana>. The certainty of <tok> is difficult to calibrate because of the complexities involved in @ref, @rgx, and @chars.. A future version of this application may support that feature.

Updater

Location: utilities/Updater/Updater.xsl

This master stylesheet is the public interface for the application. The parameters you will most likely want to change are listed and documented below, to help you customize the application to suit your needs. If you are relatively new to XSLT, or TAN applications, see Using TAN Applications and Utilities in the TAN Guidelines for general instructions. If you want to avoid changing the master application file, use the accompanying configuration file. Or make a copy of this file and edit and run it directly. Or create and configure a transformation scenario in Oxygen, defining the relevant parameters as you like. If you are comfortable with XSLT, try creating your own stylesheet, then import this one, and customize the process. To access the code base, follow the link in the <xsl:include> at the bottom of this file.

Version 2021-07-07

Description

Primary input: any TAN file version 2020

Secondary input: none

Primary output: the TAN file converted to the latest version

Secondary output: none

Nota bene:

To convert TAN files from a version earlier than 2020, use applications released with prior alpha versions.

Prev	Up	Next
Configuring and running an XSLT application	Home	TAN Applications