Improving your results
ETA’s main objective is to mark up documents using a specific Document Processing configuration, with excellent results.
In a perfect world, the Document Processing configuration would have no errors or omissions, ETA would never change because it would always be optimal, and no new document formats would ever be introduced.
However, in the real world we live with imperfections and changes, but we seek to be able to demonstrate:
- each change to a Document Processing configuration provides better results without introducing side effects
- each upgrade to ETA provides at least the same results as the pre-upgrade ETA, again without introducing side effects
- new types of documents or formats can be assimilated into ETA, again without introducing side effects
We use ETA’s tools to generate the metrics by which we can assess the effects of any change, and to demonstrate the scale and scope of improvement.
We use some standard procedures to ensure the metrics we generate are repeatable and consistent. These include:
- the generation and use of
- document assessment based on tag category
- document assessment protocols based primarily on Triage and secondarily on Cause
- preservation of Document Assessment Tags across reprocessing
A reference collection is a collection of documents that have been:
- ingested with the source stored
- processed by a specific version of the appropriate Document Processing configuration
- marked up with document tags representing the per-document assessment
- exported as a project [collection + source + Document Processing configuration]
The collection therefore represents a point-in-time snapshot of ETA’s processing of a set of known documents with a known configuration.
A collection which contains, say, 100 documents that represent the spectrum of the documents of interest is a candidate for use as a reference collection. Of course, by using 100 as the collection size it means all associated counts are also percentages (25 documents represent 25% of the collection).
The abstract use case for a reference collection is:
- import the collection project to ETA
- copy the collection to a new collection in full, then reprocess the new collection using the same document processing configuration
- use text reference evaluation to compare the reference collection to its copy.
More definitive use cases include:
- performing an impact analysis of introducing a new version of ETA
- assessing and refining incremental changes to the Document Processing configuration
- in this case, you do not reuse the original Document Processing configuration, but rather use the modified one!
- make changes to the DPC for one tag category at a time - this makes any side-effects apparent in the evaluation
- the text reference evaluation forms part of the metrics by which you can judge improvement
- reporting on progress of incremental changes to Document Processing configuration
Triage is an assessment protocol that seeks to divide a mixture of problems into three broad tranches.
In ETA’s case we often use the tranche names Very Good, Good and Difficult when performing an assessment. The absolute meaning of the names varies with context, but:
- Very Good generally implies optimal
- Good implies acceptable, with the added implication that improving to Very Good will be cost-effective
- Difficult implies a poor result that may be difficult to improve
There is a fourth tranche in common use, for documents that do not include any text that should have been marked up - this is the No Matching Information tranche. Technically you could argue that these belong in the Very Good tranche, however it is useful to make the distinction.
Within the Triage tranches Good and Difficult, it is useful to further categorise the problem by cause - this makes it possible to group changes based on angle of attack and prioritise changes based on expected benefit.
Common cause names in use include:
- Missing Tag - a text reference has not been correctly identified
- Span Error - a text reference has been identified, but does not include all of the text that it should
- Spurious Value - a text reference has been identified where it should not have been
- Syntax - a text reference has not been identified due to spelling, grammatical or format issues
A document may be assessed as having one or many problem causes, the cause tags are added to the document as appropriate.
This is usually performed on the Search screen using Modify Document Tags to add document assessment tags to each document. Of course, the collection must first have been marked up!
An example step-by-step process for document assessment:
- Select a tag category to assess.
- Perform a search on that tag category. This displays the result set with the tags highlighted. For each document in the result set, apply a triage tag and cause tags as appropriate.
- Perform the inverse search on that tag category. This result set contains all of the documents that do not have any instance of the tags. Apply the No Matching Information triage tag as appropriate. The balance of the documents in this result set have syntax or missing tag as the cause tag, and Good or Difficult as the triage tag.
- Perform a search on the triage tags. The document count in the result set should match the collection count, otherwise you have missed tagging some documents!
- Perform a search on the four tranches of triage tags. Use the search results count to set your tranche counts.
- Perform a search on the cause tags by name. Use the search results to set your cause counts. Because these are not one-to-one with documents, you will need to count combinations to get the numbers to add up!
- Repeat for each tag category that needs assessment.
- Use text reference evaluation comparing the collection to itself. This gives you the tag counts for each of the selected tag categories.
At this point you have a collection with an accompanying assessment profile. You might want to export the collection and the Document Processing configuration as a project for safekeeping.