Improving your results

ETA’s main objective is to mark up documents using a specific Document Processing configuration, with excellent results.

In a perfect world, the Document Processing configuration would have no errors or omissions, ETA would never change because it would always be optimal, and no new document formats would ever be introduced.

However, in the real world we live with imperfections and changes, but we seek to be able to demonstrate:

We use ETA’s tools to generate the metrics by which we can assess the effects of any change, and to demonstrate the scale and scope of improvement.

We use some standard procedures to ensure the metrics we generate are repeatable and consistent. These include:

Reference collections

A reference collection is a collection of documents that have been:

The collection therefore represents a point-in-time snapshot of ETA’s processing of a set of known documents with a known configuration.

A collection which contains, say, 100 documents that represent the spectrum of the documents of interest is a candidate for use as a reference collection. Of course, by using 100 as the collection size it means all associated counts are also percentages (25 documents represent 25% of the collection).

Use cases

The abstract use case for a reference collection is:

More definitive use cases include:


Triage is an assessment protocol that seeks to divide a mixture of problems into three broad tranches.

In ETA’s case we often use the tranche names Very Good, Good and Difficult when performing an assessment. The absolute meaning of the names varies with context, but:

There is a fourth tranche in common use, for documents that do not include any text that should have been marked up - this is the No Matching Information tranche. Technically you could argue that these belong in the Very Good tranche, however it is useful to make the distinction.


Within the Triage tranches Good and Difficult, it is useful to further categorise the problem by cause - this makes it possible to group changes based on angle of attack and prioritise changes based on expected benefit.

Common cause names in use include:

A document may be assessed as having one or many problem causes, the cause tags are added to the document as appropriate.

Document assessment

This is usually performed on the Search screen using Modify Document Tags to add document assessment tags to each document. Of course, the collection must first have been marked up!

An example step-by-step process for document assessment:

  1. Select a tag category to assess.
  2. Perform a search on that tag category. This displays the result set with the tags highlighted. For each document in the result set, apply a triage tag and cause tags as appropriate.
  3. Perform the inverse search on that tag category. This result set contains all of the documents that do not have any instance of the tags. Apply the No Matching Information triage tag as appropriate. The balance of the documents in this result set have syntax or missing tag as the cause tag, and Good or Difficult as the triage tag.
  4. Perform a search on the triage tags. The document count in the result set should match the collection count, otherwise you have missed tagging some documents!
  5. Perform a search on the four tranches of triage tags. Use the search results count to set your tranche counts.
  6. Perform a search on the cause tags by name. Use the search results to set your cause counts. Because these are not one-to-one with documents, you will need to count combinations to get the numbers to add up!
  7. Repeat for each tag category that needs assessment.
  8. Use text reference evaluation comparing the collection to itself. This gives you the tag counts for each of the selected tag categories.

At this point you have a collection with an accompanying assessment profile. You might want to export the collection and the Document Processing configuration as a project for safekeeping.