Creating and populating TAN files

Creating and populating TAN files
Prev	Chapter 8. Working with TAN Files	Next

TAN is a representational format. Every TAN file models some source.

If those sources are non-digital, it is a relatively straightforward task to create and populate a TAN file. You just start editing everything by hand. In some cases, you might get a head start with an algorithm. For example, optical character recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T file. Applying OCR to a printed index of quotations might get you the basic start to a TAN-A file. Despite the computer's assistance, the majority of the task will be spent in correcting any conversions. Thoughtful, scholarly attention is critical to making these files suitable for use.

In many other cases, you are trying to take something that already exists digitally and convert it into a TAN format. If you find a Word file, a web page, or a plain text file that can serve as the basis for a TAN file, a common first impulse is to copy the desired content, paste it into the body of an empty TAN file, then manually correct the material. That solution is quick and easy, but short-sighted. You may find that you made a major mistake, and you have done so much work, you cannot backtrack. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that you didn't realize at the time were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might be find out that the pre-TAN data sources you started out with have been updated, with other things corrected. If any significant time has elapsed, you may have forgotten what procedure you followed to convert the data. And even if you remember, you will have to repeat the steps again, and dread the day when those pre-TAN sources are updated yet again.

In these cases, it is advised to think not about fixing the files, but rather about developing a system to fix the files. Your goal try to create a digital pipeline/workflow that can be applies when needed, so that changes to those pre-TAN versions can be channeled into your TAN library. If you or a project member has experience in XSLT, it is a good idea to develop stylesheets to convert the data to TAN. When you find mistakes such as those described above, no harm is done. You can simply adjust and re-run your process, each time getting better and better results. An XSLT-based approach requires extra work, initially. Establishing a stable transformation process can be time consuming, since it requires repeated sequences of trial, error, and diagnosis. But the investment pays off in the long run, especially if you are dealing with dozens, hundreds, or thousands of files. The routines you write for one set of files might be useful for the next.

Here is one approach. Create a template skeleton TAN file that resembles your desired output. Develop a XSLT stylesheet that does the following:

Fetches the pre-TAN file (main input).
Puts the main input in an XML tree, then applies select alterations.
Fetches the template TAN file.
Pushes the altered pre-TAN content into the template file.
Saves the infused template, either as the primary output, or as a result document.

One of the challenges to this method is that the pre-TAN input might not be XML, in which case it cannot be the initial, catalyzing input to the XSLT file. But that is fine. For such conversions, you can make your XSLT file a MIRU (main input resolved uris) stylesheet. A MIRU stylesheet has as its catalyzing input any XML file, including itself. That initial, catalyzing input is unimportant, because a MIRU stylesheet, through global parameters and variables that point to resolved uris, fetches the main input. For an example of a MIRU stylesheet, see applications/compare/compare TAN class 1 files.xsl.

The method described above has been used successfully to handle several different kinds of conversion, including ones where the source files are updated very frequently. In such scenarios, the traditional cut-paste-and-edit method is not only unproductive; it is foolish.

Writing transformations can be laborious at first. Finding the best way to handle and manipulate a pre-TAN file is an intellectual challenge with multiple solutions. But there is a good chance that some of the labor you have in mind has already been done for you in a TAN function (see Chapter 11, TAN variables, keys, functions, and templates) or application (see the subdirectory applications).