Refining the gold standard and adding rules

There are four main steps to building an effective rule set. To view the workflow click here.

This section covers the third and fourth steps in the workflow: refining the gold standard and adding rules.

For a gold standard to be most effective, each document within it must contain only the elements you want the rule set to be able to select. However, when a gold standard document is first created it may contain some boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements. and/or contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). you do not want, depending on the effectiveness of the rule set used to harvest the web page. You need to refine it, which means removing all the elements you do not want the rule set to be able to select and adding the elements that you want it to be able to select.

This is a manual process which is simplified by being able to view the gold standard document (which contains only text) beside the corresponding full page documentA document in a ETA Harvester gold standard collection that contains every harvestable element (content and boilerplate) from a web page. (which contains all the elements from the original web page). As you select elements on the full page document to add to the gold standard (or remove from it), you immediately see the effect on the gold standard. When you are satisfied with the content of the gold standard you can create rules to select those elements.

Elements are colour coded to indicate their status:

Figure 1: A full page document and the corresponding gold standard document

The Document table on the left pane indicates the F1 score and the number of correct, spurious and missed elements in the currently selected document and across the gold standard collection.

Figure 2: The Document table

To refine the gold standard and add rules:

  1. In ETA, open the project that contains the gold standard you want to refine.
  2. On the Main Navigation Bar click Configurations.
  3. Click Harvester Rule Sets.
  4. Click on the rule set associated with the gold standard.
  5. The rule set panes are displayed.

    Note: To increase the space available for these panes you can collapse the Configurations and Harvester Rule Sets panes. Click the left arrow button at the top of a pane.

  6. In the Document list, click on one of the documents.
  7. The full page document is displayed in the centre pane.

  8. In the right pane click Gold Standard.
  9. If the rule set used to harvest the web page was effective, the gold standard will already contain a significant amount of text. In the full page document, the elements that are already in the gold standard but have not been selected by a rule are highlighted in pink. The number of these elements is shown in the Documents table in the Missed column.

  10. Hover over an element (the element is highlighted in yellow) then click on it.
  11. A popup shows the element’s path and whether or not the element has been selected by a rule and is in the gold standard.

  12. Do one of the following:
  13. Note: If many unwanted child elements are in a gold standard and encased within a shared parent element, add the parent element to the gold standard then remove it. This removes all the child elements quickly.

  14. To add a rule to harvest this element from similar web pages:
  15. To see which elements have been selected by the rule, click the checkbox beside the name of the rule. The elements are highlighted with the same colour as the rule.
  16. To edit the rule, click on the rule, make the changes you require then click Save. For a description of each field and option click here.
  17. To apply the rule to the other documents in the collection click Update Now under the document table.
  18. Continue adding rules until you are satisfied with the F1 score and number of spurious and missed elements.
  19. To configure batch harvest parameters click the Rule Set Configuration tab, enter the values you require then click Save. For a description of each parameter click here.
  20. Test the rule set for effectiveness. Run a small batch harvest then assess the results. Repeat this until you are satisfied with the results.