Planning your information extraction project

Obtaining the most value from ETA involves using and configuring ETA’s information extraction capabilities. These capabilities are:

Our recommended approach to running information extraction projects is outlined below. If your project is small you can adopt a more informal approach, but we recommend that at the very least you review the output requirements and data thoroughly.

Before you begin

Some points to consider before you begin:

The recommended approach

Step 1.

Determine, as precisely as possible, the information you need from the project documents.

  • What items of information are needed (for example, entities, document numbers and titles, and citations)?
  • Is the information in tables or is it embedded in free text?
  • Is it necessary to recognise all references to these items in each document (as in entity network creation) or is it just necessary to extract the value of each item (for example, the document’s title)?
  • What is known about the information to be extracted? For example:
    • Does it have a sequential pattern?
    • Are there constraints on the terms used to describe it? For example, its value may be a number or it may be written in upper case.
    • Are there contextual indicators as to where it might be found? For example, ’D.O.B’ or "date-of-birth"
  • What are acceptable error and miss rates (which may depend on specific items)?

Step 2.

Review the documents to understand where the information can be found and how it could be extracted.

Step 3.

Revise your information needs, now that you have reviewed the data.

Step 4.

Identify additional sources of the information that may be easier to work with, or parts of the information, such as lists of items of interest.

Step 5.

Decide whether ETA can extract the information you need ’out of the box’ or needs to be configured.

Step 6.

Create a plan that answers these questions:

  • What information is to be extracted?
  • How it is to be identified? Can it be identified more simply, for example, using dictionaries rather than entity extraction scripts?
  • In what form is the information required? For example, to support search or network creation or to generate an output listing.
  • How is the extraction to be verified to confirm that it has acceptable error and miss rates?
  • Will the project be conducted in stages? For example:
    • With sections of the information extracted in each stage?
    • With document types handled separately?

Step 7.

Validate the plan by doing a trial on a modest data set to verify your approach.

Step 8.

Iterate until you have a satisfactory plan.

Step 9.

Use the guidelines in the topic Improving your results to complete your project.

 

fontfontfont