Chapter 10. Best Practices in Working with TAN Files

Chapter 10. Best Practices in Working with TAN Files
Prev	Part III. Working with the Text Alignment Network	Next

In this chapter we discuss ways to manage, create, edit, and share TAN files. The material discussed here is non-normative. That is, these are suggestions based upon the experience of TAN users.

Local Setup

TAN files may be set up in any kind of structure one wishes, but because those files are meant to be shared and interlinked, it is beneficial to use similar local conventions, so that relative URLs remain intact from one person's system to another. It is especially important that collections be able to "talk" to each other via local URLs in @href, so it is a good idea to name collection subdirectories as predictably as possible.

Below is one way to organize the subdirectories of a typical setup for local TAN work:

library-[abbreviated name of creator 1]
- [abbreviated name of collection 1]—TAN-T(EI) files here
  - TAN-A-div (for TAN-A-div files)
  - TAN-A-tok (for TAN-A-tok files)
  - [etc.]
- [abbreviated name of collection 2]
- [etc.]
library-[abbreviated name of creator 2]
output—saved results from transformations, tests
pre-TAN—third-party files to be used to populate TAN files, or to be converted into them
TAN-2018 —the core TAN files, downloaded from the website or the Git repository
stylesheets—stylesheets you have created
tools—third-party tools

Under this approach, you create a library subdirectory for each provider or creator (including one for yourself). For any TAN corpus you publish, you should advise what name should be used for the library subdirectory. Likewise, for any TAN corpus you download, you should use the library name suggested by the provider.

Any time you create or download a collection of TAN files, you save them in a subdirectory within the creator's library subdirectory. Once again, you should advise on the name to be used, and use the names that are advised.

If you use Git, it is advisable to make each collection its own Git repository. If you use GitHub, it is advisable to use your username for the library subdirectory.

This two-step approach to subdirectories anticipates cases where different people will want to encode the same body of texts, particularly heavily quoted collections that will commonly be given very brief, descriptive names, e.g., bible, quran.

When you name class 1 files (the filename, not the IRI name; see the section called “@id and a TAN file's IRI Name”), it is a good idea to start with an acronym or abbreviation for the work, followed by the language code, the editor's last name, the date when the source scriptum was created or published. If a work lends itself to multiple reference schemes, you may need to include that in the filename. Some examples:

ar.cat.grc.1949.minio-paluello-sem.xml (Aristotle's Categories, in Greek, 1949, edition by Minio Paluello, following a reference system based on semantic units [paragraphs, sentences, independent clauses]).
apocr.eng.kjv.1760.xml (apocrypha, English, King James Version, 1760 edition)
tlg0059.tlg031.perseus-grc1-Pl.Ti.xml (Plato's Timaeus in Greek)

Class 2 files are tougher. Because they bring two or more files or concepts together, filenames could become very long or unpredictably structured. At this time, the best recommendation is to make sure that each class 2 file is put into a subdirectory, separate from class 1 files, given a brief but meaningful name that points to the research question that motivated its creation. Some examples:

ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml (lexico-morphology for Aristotle's Categories, in Greek)
nt.grc-syr.selections.TAN-A-tok.xml (word-for-word correspondences between the Syriac and Greek New Testaments)
plato.TAN-A-div.xml

Class 3 are a bit easier. It is recommended that TAN-mor files begin with the language code then an acronym for the person or group responsible for creating the features. TAN-key files are written generally to serve a specific project or collection, so the collection name and the TAN type should suffice. Examples:

ar.cat.TAN-key.xml
eng.kalvesmaki.com,2014.1.xml (tagging scheme #1 for English)

If you have a local copy of someone else's TAN collection, and you wish to create TAN files that depend on them, you are in all likelihood going to use relative URLs to copies of the files stored on your local drive. It is recommended that you also include absolute URL through secondary <location>s. The validation routine checks only the first document available. From time to time, you might comment out the first <location> and run the validation process again. If you share your dependent TAN file with someone else who does not have a local copy of the collection, the second <location>, with the absolute URL, will point to the original copy of the document.

In a given project, you are likely to repeat basic information, particularly <person>, <role>, and <work>. such as elements with the the section called “IRI + name Pattern”, consider moving those to a TAN-key file. It is almost always preferable to develop TAN-keys before resorting to <inclusion>s. Sorting out lines of inclusion can be confusing.