Entity Extraction Scripts provide a number of advanced capabilities which make it easier to write concise and fast code with controlled dependencies.
Instead of repeating a common set of matching rules you can define the pattern once and then form an alias. This alias can then be used for matching.
//Create an alias GREETING:
#alias GREETING =
Token<text() ="welcome"> #
//this can then be used in several rules for example:
In projects you frequently want to create a document processing configuration that will work well across a range of types of document. Often each document type requires a different EES file. Unfortunately, the EES rules intended for one document type may fire when processing another, causing errors or misses. To avoid this type of problem and to keep the task of writing EESs as simple as possible, you can use ETA's built-in document classifier to classify the different document types and give them identifying tags. These tags can then be used to trigger individual entity extraction scripts.
To make the operation of an EES on a document conditional on the presence of a document tag (for example, "MyDocType" in category "DocTypes"), insert the following command before the first rule in the entity extraction script:
#cond document.tag<category="DocTypes", name="MyDocType"> #
It may be that you have rules in an EES that should only be applied in a specific context, like a particular section of a document.
For example, consider a document with an Executive Summary at the beginning, followed by an Introduction and other normal document sections. In this example, the document sections are clearly labelled and a simple dictionary has been used to insert text references with name "section_marker" and feature "key = [document section]". The key has values like "executive_summary", "introduction" etc.
The example below shows the use of
#section to apply a rule in the EES.
//Create a section called EXEC that only executes in the Executive Summary context:
#section EXEC = tag:section_marker<key = "executive_summary">, tag:section_marker<key != "executive_summary"> #
#sectionend EXEC #
The effect of this snippet would be to tag every Token in the executive summary with tag "InSection".
Syntax for the
#section label is:
#section section-name = matching-pattern1, matching-pattern2 #
Matching-pattern1 and matching-pattern2 follow the EES syntax for Matching Patterns.
Note: The dash symbol may be substituted for either matching-pattern. It acts as a wildcard with a meaning of "any".
#section BOD = - , tag:section_marker #
Replacing matching-pattern1 with dash means that section BOD applies from the beginning of the document to the first occurrence of tag section_marker.
Sections may be nested.
Syntax for the #sectionend label is:
#sectionend section-name #
where section-name matches the corresponding section label.
The use of the
#sectionend label is optional. If omitted, all of the rules to the end of the EES are considered to be part of the section. The recommended practise is to always use a
Sintelix’s EES rule engine runs very fast - but it is still possible to write rules that are very slow to execute.
To make rules run fast, use the rarest and most specific pattern elements in matching patterns.
If any text graph (and therefore any text block) doesn't contain a pattern element required by a rule the entire rule is skipped for that graph.
You could match an exclamation mark (!) with either of the rules below.
The first rule is a slow rule because every node contains a token. This rule requires that each token is tested to see if its text is an exclamation mark.
The second rule is faster because Token.punctuation.exclamation is much rarer and the number of times the rule is run is therefore drastically reduced.
The first link of any sequence is the most important. You should try not to start sequences with very common pattern elements: choose the rarest first (if you can) and then work along to the most common.