Processing and reprocessing documents

Documents are automatically processed by ETA during the creation of a document collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents. or whenever a document or group of documents is added to that collection.

Reprocessing documents is extremely straightforward.

Instructions for processing documents

Document processing is first handled during the Document Collection Creation phase.

You can change the selected Ingestion Configuration for the collection from the Document Collection page.

The Configuration window is displayed as part of the Add Documents pop up window, selected from the Document Collection page by the Add Documents button.

This displays the Add Documents window together with the Configuration window.

From the Configuration Window you can select the Ingestion Configuration you require.

Instructions for reprocessing documents

In general, reprocessing documents first removes all of the document markup, and generates a new markup according to the current version of the Document Processing Configuration.

This generally means that all manual markup in the collection will be lost. See the next section for the way in which manual markup can be preserved.

Some reasons why you might want to reprocess all documents in a collection include:

Reprocess Documents is an action available on the Document Collection page.

Clicking on Reprocess Documents brings up the Reprocess Documents dialogue.

Change the selected Ingestion Configuration via the drop-down menu if desired.

Click the Reprocess button to reprocess the documents or the Cancel button to cancel without reprocessing.

Preserving markup while reprocessing

Certain types of markup are automatically preserved during reprocessing, namely:

The names "Native" and "Metadata" are Fact Categories (that is, name-spaces) with protected status. Document Processing is not allowed to add, change or delete Facts that belong to these categories.

The Reprocess Documents dialogue allows you to define further Fact Categories with protected status. Be aware that not only does this preserve the Facts across reprocessing, it prevents the Document Processing Configuration from adding, changing or deleting any Fact in a protected Category.

To add a Fact Category to the protected list for reprocessing, first check the ‘Exceptions’ checkbox in the Reprocess Documents dialog.

This expands the dialogue to include a selectable list of Fact Categories for the collection (the list is assembled from all of the Fact Categories in use in the collection).

Note that categories Metadata and Native are displayed but are not de-selectable.

Putting a check in the box for any other Fact Category will add that category to the protected list. The choices made are saved for future reprocessing once the Reprocess button has been pressed. Exiting the dialogue by pressing Cancel will not save any choices made.

 

fontfontfont