Entity Extraction Scripts are written as a sequence of rules. Each rule works on a graph (which is a sequence of nodes and links - see The Text Graph) and attempts to:
Within each text block, rules execute in order of being written, with output of each rule available to the rules that follow. If multiple EES files are used then they are applied to each text block in the sequence they appear in the Document Processing Configuration.
Rule matching is greedy. If, starting on an arbitrary node, the same rule can be matched in several ways, the longest is chosen.
All white space is considered equal. Use newlines, tabs and spaces to indent and align rules for best readability.
Use C++ / Java comments style.
//this is a comment until end of line/* this is a comment
which spans multiple lines
Each rule has two parts:
The basic syntax for a matching pattern matches a sequence of pattern elements on the graph. Each pattern element usually matches a link. For example, a matching pattern might have three elements:
Links in a matching pattern can be listed across the page without changing the meaning:
pattern_element1 pattern_element2 pattern_element3
The token string "Paris is fun" is matched by the following sequence:
Pattern elements may contain pattern element conditions, which serve to make the pattern element more selective in matching graph elements:
pattern_element1<conditions1> pattern_element2<conditions2> pattern_element3<conditions3>
Conditions are expressed in relation to the features of a graph element. The conditions are contained between angled brackets ("
<" and "
>") and there may be several:
pattern_element<value1_left=value1_right, value2_left=value2_right, ....>
Each condition takes the form of an equality or an inequality:
where left or right values can be constants or functions.
The most common output phrase when matching a sequence is creating a new link.
The syntax for link creation is:
value_left=> new_link_name<feature1=value1, feature2=value2...>
The features within the angle brackets ("
<>") are then added to the newly created link.
Let's look at a full EES rule for this:
The ETA graph viewer (above) shows that the link "exercise:Exclamation" has been created over the text "hello, world!" with the feature "welcome" set to "true".
Note: If you want to see the output of your rules, use the "
tag:" name space, for example:
To create a visible text reference, use the namespace "
tag" with the output phrase.
> tag:tag_name<...> \\ where "..." are feature settings
We have some text below:
The blue car, which was leaking oil, drove west.
We apply the following matching pattern:
This rule will fire on the text segment "blue car" and generate a text reference with tag name "Blue_car", as shown below.
Entity Extraction Scripts can be inserted into the ETA workflow - so that they can be used to process documents in bulk.
You have the choice of two stages in the ETA workflow for your entity extraction scripts - Early and Late. Built-in learned entities (with link types such as tag:Person, tag:Organisation and tag:Location) are only available for use in matching patterns when the script is inserted for Late Stage processing.
EESs are inserted into the ETA workflow via the Document Processing Configuration - which can be accessed via the Configuration menu or directly from the Text Graph Analyzer. The image below shows the document processing configuration needed to run the Example EES.
EESs are run over documents in the order they occur (from top to bottom) in the tables, below.
Note: Acronym detection is enabled only for a text references created by an Entity Extraction script added at the early stage. Acronyms will not be identified for text references created by a late script, therefore acronym detection also should be handled by the script.