There are four main steps to building an effective rule set. To view the workflow click here.
This section covers the third and fourth steps in the workflow: refining the gold standard and adding rules.
For a gold standard to be most effective, each document within it must contain only the elements you want the rule set to be able to select. However, when a gold standard document is first created it may contain some boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements. and/or contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). you do not want, depending on the effectiveness of the rule set used to harvest the web page. You need to refine it, which means removing all the elements you do not want the rule set to be able to select and adding the elements that you want it to be able to select.
This is a manual process which is simplified by being able to view the gold standard document (which contains only text) beside the corresponding full page documentA document in a ETA Harvester gold standard collection that contains every harvestable element (content and boilerplate) from a web page. (which contains all the elements from the original web page). As you select elements on the full page document to add to the gold standard (or remove from it), you immediately see the effect on the gold standard. When you are satisfied with the content of the gold standard you can create rules to select those elements.
Elements are colour coded to indicate their status:
The Document table on the left pane indicates the F1 score and the number of correct, spurious and missed elements in the currently selected document and across the gold standard collection.
To refine the gold standard and add rules:
The rule set panes are displayed.
Note: To increase the space available for these panes you can collapse the Configurations and Harvester Rule Sets panes. Click the left arrow button at the top of a pane.
The full page document is displayed in the centre pane.
If the rule set used to harvest the web page was effective, the gold standard will already contain a significant amount of text. In the full page document, the elements that are already in the gold standard but have not been selected by a rule are highlighted in pink. The number of these elements is shown in the Documents table in the Missed column.
A popup shows the element’s path and whether or not the element has been selected by a rule and is in the gold standard.
Note: If many unwanted child elements are in a gold standard and encased within a shared parent element, add the parent element to the gold standard then remove it. This removes all the child elements quickly.