Creating and populating TAN files

Creating and populating TAN files
Prev	Chapter 10. Best Practices in Working with TAN Files	Next

TAN is a representational format. Every TAN file models some source.

If those sources are non-digital, it is a relatively straightforward task to create and populate a TAN file. You just start editing everything by hand. In some cases, you might get a head start through a rough computer algorithm. For example, optical character recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T file. Or OCR on an index might get you the outlines of a TAN-A-div file that indexes all quotations. Despite the computer's assistance, the majority of the task is converting non-digital claims into digital ones, and the manual effort is central.

In many other cases, you are trying to take something that already exists digitally and convert it into the TAN format. In these cases, it is advised to think of the problem computationally, and do your best to resist the urge to manually edit anything.

Suppose you find a Word file, a web page, or plain text that can serve as the basis for a TAN file. A common first impulse is to copy the desired content, paste it into the body of our TAN file, and then begin to manually correct and change things. You may find that you made a major mistake that cannot, at that point be undone. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might be find out that the pre-TAN data source has been updated, with errors corrected. If any significant time has elapsed, you may have forgotten what procedure you followed to convert the data. And even if you remember, you have to repeat the steps again, and plan for the next time when the pre-TAN source is updated. Or you find yourself making piecemeal corrections.

For all these reason, it is recommended that you set up an XSLT-based workflow to convert the data to TAN. When you find mistakes such as those described above, no harm is done. You can adjust your algorithm and re-run the process as often as you need, each time getting better and better results. This approach requires extra initial work. That is, you will need to get to know XSLT (or an alternative) well. Establishing a good transformation process can be time consuming. But the investment pays off in the long run. The routines you write for one set of files might save you some work for the next.

Under this method, you should begin the process by creating a template TAN file that resembles, even if skeletally, your desired output. You then write XSLT-based rules that (1) make alterations to the input, (2) infuse the altered input into the template, then (3) save the new file. This method has been used successfully to handle several different kinds of conversion, including ones where the source files are updated very frequently. In such cases, the traditional cut-paste-and-edit method is not only unproductive; it is foolish.

Writing transformations may seem laborious at first, because of how difficult it is to think how how best to handle and manipulate a TAN file. But there is a good chance that the labor you have in mind has already been done for you in the built-in TAN functions (see Chapter 11, TAN variables, keys, functions, and templates). See also the files provided under the subdirectory /do things.