When documents and files are uploaded to ETA, they are processed by an ingestion configuration. The ingestion configuration is responsible for managing what happens to documents, files, URLs, and their textual content, when they are uploaded.
To configure the ingestion process:
OCROptical Character Recognition, a method of converting images of typed, printed or handwritten text into machine-readable text. enables you to extract text from images, for example, PDFs, that were created using scanned images of a physical document. OCR processing of such files or images generally gives a much better result than ingesting them ‘raw’. The OCR server returns the scanned document as HTML for ingestion by ETA. However, OCR processing is not fast and it typically takes a few seconds per image to extract the text.
The following options can be set for OCR processing:
This option allows you to ingest files with a specific content type as a plain text source. For example, when the text/html content type is added, ingested web pages would display its full raw HTML markup, and not the parsed HTML content.
Below are a few examples of content types, and the types of files that normally use them:
When HTML Cleaning is enabled, non-content related and hidden elements in web pages, (such as unwanted social media links, ads and navigation links), will automatically be detected and removed. This is particularly useful when you only want to extract only the content of a news or blog article for example.
This generates additional document content as text blocks at the beginning of an ingested document's content. The additional text can be derived from a file's metadata, information which is normally stored under a document's properties.
Enabling this setting will ensure that only one version of duplicate documents will be ingested.
This setting enables you to store source documents, images and the archive manifests in ETA.
The following options can be defined for each source file type:
Enabling this setting will create a manifest for archive files if any are detected and ingested.
Ingestion rules are used to decide what should be done to a document depending on the characteristics of a document. There can be more than one rule and the order of the rules is important, as the first matched rule is applied for a document.
The stages are: