Dictionaries provide a simple, highly productive method for creating text references in documents, and more generally creating links on text graphs.
Entity Extraction Scripts (EESs) work much faster when Dictionaries are use to create the initial text graph literals (which avoids using matching pattern elements generic matching pattern elements like Token as in Token<string=XXX>, which are common and so give slow running EES rules).
At its most basic, the Dictionaries comprise Word Lists which are lists of words and phrases that ETA can then recognise in documents. Each word or phrase in a word list is called an entry.
All white space is consider equal except for line feeds, which are used to separate command phrases and word list entries. Blank lines can be added for readability without changing the meaning of a Dictionary.
Use C++ / Java comments style:
// this is a comment until end of line
/* this is a comment
which spans multiple lines
The core commands available in Dictionaries include:
Defining word lists
#cols column1, column2, ...// where “column1” etc. are column definitions
#attribute:typevalue// where "attribute" can be "feature" or "cond" or "generalize"
(Feature type and value are variables.)
Additional conditions that need to be met before the snippet is labelled
#cond:type// where "type" can be "case" or "context"
Generalizing singular words and phrases to detect plural versions
#generalize:type[true|false]// where "type" takes the value "plural"
Here are some simple examples of using dictionary to mark up words and phrases (see the ETA Dictionary demonstration" Demo 1. Basics"):
The Chief Executive Officer John Smith asks his secretary to send a letter.
// starts a Word List with #wordlist "Demo1-jobtitle"
// add entries by writing each as an ordinary line
chief executive officer
// you can add blank lines or comments for readability
The Text Graph Analyzer is included as part of the Dictionary development page in ETA. In this case it provides the following diagnostic information:
To create a visible text reference the link is created in the name space "tag" with the output phrase.
The victim was rushed to Hospital.
Police secured the area, searched for weapons and other evidence. The
suspected assailant was apprehended carrying a
recently discharged pistol.
This generates the following output in the document view:
Certain characters and character sequences have special meaning within the text of Word Lists:
# , * // /* */
They need to be "escaped" when they are used in the text of a word list entry. In ETA Dictionaries, the text to be escaped is surrounded with double quotes (").
Within escaped text, the back slash (\) is used to escape the back slash and double quote characters.
#BP, one, zero, x * y, a // b, double " quote, back \ slash
number of = #
He said, "The wild card is *"
// note that [double ["] quote] contains nested matches
hello world // phrase is ok
"#BP" // escaping hash (#)
"one, zero" // escaping comma (,)
"x * y" // escaping asterisk (*)
it's // apos (') is ok
x-mas // dash (-) is ok
N\A // back slash (\) is ok
single slash / is ok
"a // b" // escaping (//)
"a /* b" // escaping (/*)
"a */ b" // escaping (*/)
number of = "#"
// Escaping " and \ within double quotes
"double \" quote" // escaping (") becomes (double " quote)
"\""// escaping (") by itself becomes (")
"back \\ slash" // escaping (\) becomes (back \ slash)
He said", \"The wild card is \*\""
Dictionaries can be inserted into the ETA workflow - so that they can be used to process documents in bulk.
Dictionaries are inserted into the ETA workflow via the Document Processing Configuration - which can be accessed via the Configuration menu or directly from the Text Graph Analyzer. The image below shows the document processing configuration needed to run three demonstration Dictionaries (Demos 1, 2 and 4).
Dictionaries are run over documents during processing. They are listed in the table in the Document Processing configuration, illustrated below.