Rule sets

A rule set is a group of rules designed to select the elements on a web page that are most likely to contain useful contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements).—such as headings, authors, dates, captions and paragraphs—and not select boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements.. New ETA projects automatically contain a number of pre-defined rule sets for harvesting text from news sites, wikis, forums and Google searches, and from specific domains such as Twitter, Facebook and LinkedIn. The rule sets are stored in a configuration titled ‘Harvester Rule Sets’. You can modify these rule sets, delete them and/or create your own.

Note: To create or modify a rule set you need a solid understanding of HTML, including nested elements and classes.

ETA Harvester automatically applies the most relevant rule set to each web page you want to harvest. In page harvesting, if you are not satisfied with the elements selected by the rule set you can manually select and deselect elements or apply another rule set.

In batch harvesting you cannot choose the rule sets that are used. If you are not satisfied with the results (for example, too many boilerplate elements are being selected or content you want to harvest is being missed) you can customise the pre-defined rule sets or create you own.

When you create a rule set ETA automatically creates a collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents. with the same name and the suffix ‘GS’, to indicate that it is a gold standard collection. The next step is to create a gold standard and add rules to harvest the content in it.

To view the workflow for creating rule sets click here.

Gold standards

A gold standard is a set of model data that you can learn from and test on. In ETA Harvester, a gold standard is a collection of documents that contain text harvested from web pages for the specific purpose of creating a rule set. A gold standard should contain only the elements you want to harvest from the web pages, and nothing else.

The web pages you harvest must be representative of the content you want to harvest with the rule set (for example, an article from a specific news site or a social media profile from a specific site).

The number of web pages you need to harvest to create a gold standard collection will vary according to the uniformity of the HTML elements on the pages. For example, if you are creating a rule set to harvest content from a single news site and there is very little variation in the elements that are used from one article to the next (such as a heading, subheading, author, date and paragraphs), you may only need to harvest three or four pages. For sites with greater variations, or to create a rule set that can be applied to multiple sites, you may need to harvest many more pages to gather an adequate sample.

You need two versions of each web page:

You harvest both versions simultaneously by ticking the ‘Harvest full page’ option in batch harvesting or the ‘Full page’ option in page harvesting.

Depending on the effectiveness of the rule set used to harvest the documents, the gold standard documents may contain some boilerplate elements, and/or elements that you wanted to harvest may have been missed. By comparing each gold standard document with its corresponding full page document, you can easily see—and remove—the boilerplate elements from the gold standard and add any elements that were missed. The goal is to refine the gold standard documents to the point where each is a perfect, or near-perfect example of all the text you want to harvest from the corresponding web pages.

Figure 1 shows the Rule Sets panes. In the left pane, a table displays the F1 score and a summary of the number of elements in the gold standard that are correct, spurious or missed. The F1 score indicates the precision with which the rule set is selecting the text you want and the level of recall it is achieving (that is, whether it’s missing a few or many elements). An F1 score of 1 indicates perfect precision and recall.

Until you create some rules, all the elements are listed in the Missed column.

Clicking on a document displays the full document that was harvested, with correct, spurious and missed elements highlighted. The right pane contains four tabs. In Figure 1 the Gold Standard tab has been selected.

Figure 1: The Rule Sets configuration

Elements are colour coded to indicate their status:

Figure 2: Correct elements are those in the gold standard and selected by the rule set

The ideal gold standard document has an F1 score of 1 and no spurious or missed elements. The following tasks can help you get as close as possible to this:

These tasks are covered in Creating a rule set.

How do ETA upgrades affect rule sets?

When you upgrade ETA:

 

 

fontfontfont