Harvesting web pages for a gold standard

There are four main steps to building an effective rule set. To view the workflow click here.

This section covers the second step in the workflow: harvesting web pages to create a gold standardA set of model data that you can learn from and test on. For example, in ETA, this would be a collection of documents that have been created with specific, preferred properties such as correct document tags and text references. In ETA Harvester this would be a collection of documents harvested from web pages where only the correct elements have been selected (that is, only the content you want)..

To harvest web pages for a gold standard:

  1. Find several web pages that are representative of the contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). you ultimately want to harvest using the rule set.
  2. Do one of the following: