ETA's document processing workflow

The Document Processing workflow is a sequence of modules that converts documents sent to the system (using Server Library, fetched by URL or uploaded by users or services) into an internal ETA representation which can then be saved to a document collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents. or returned back to the user or service.

Figure 1: Document processing workflow

Normalisation module reads the original document stream and produces ETA documents based on its contents.

Filtering module executes Filtering configuration of Document Processing Configuration, and discards documents based on their file extension, content-type, language or md5 sum (denisting).

Classification and Tagging module executes learned Classification and Tagging configurations as defined in Document Processing Configuration.

Entity Extraction module executes Entity Extraction Configuration as configured in Document Processing Configuration. It has an internal workflow which is accessible from Entity Extraction Configuration's test page as well as from Entity Extraction Script editor.

Document Processing scripts and Document Entity Extraction scripts are executed as part of Entity Extraction Module.

The Entity Extraction workflow

The Entity Extraction workflow is responsible for the majority of the analytical processing in ETA. The workflow sequence is visible in ETA as the left-hand pane of the Text Graph Analyzer and is reproduced below.

 

fontfontfont