ETA’s document model
ETA keeps documents in an internal representation which can exist as a memory object or in persisted form. When the document model is persisted it is stored in a distributed way to support the rapid flexible search operations that are critical to ETA's performance.
The model permits storage of:
- the set of text blocks that together make up the text of the document - typically one text block per paragraph or table cell
- the document structure in the form of a tree of XHTML elements
- text references to entities within the text of the document - which have a span and may have other features, such as ‘first name’, ‘gender’, ‘latitude’ etc.
- document entities, each of which records the common identity being referred to in one or more text references - recognising for example that ‘Obama’ co-refers with ‘Barack Obama’
- connections between text references, such as relationship mentions, for example between entities in ‘John is the son of Mary’
- relations, which comprise the set of connections that refer to the same underlying relationship between a group of entities and
- properties, which store information about any structural element, including the document itself; for example, document tags and metadata are document properties; properties are key-value pairs
A visual example of ETA's document model
To view examples of the document model, view the exported form of documents directly from ETA, for example, by clicking on the download icon in the document listing:
Note: Tag trees are not inherent to documents, they are merely a way of viewing the tags and text references in documents.