Harvesting a page

For an overview of page harvesting see Page harvesting.

To view the page harvesting workflow click here.

To harvest a page:

  1. Using Google Chrome or Chromium, go to the web page you want to harvest.
  2. Note: Each time you open Chrome you will be prompted to disable developer mode extensions. Click Cancel. If you click Disable, ETA Harvester will be disabled. This message is not displayed in Google Chromium or Linux.

  3. Click the Harvester icon at the top right of the Google Chrome screen or press Alt C.
  4. The rule set that best matches the URL is automatically applied to the web page. The elements selected by the rule set are highlighted in green.

    Text shaded in black (such as superscript references to footnotes and hyperlinks to edit text), or surrounded by a black border (such as sidebars and tables of contents), will be excluded from the harvest.

    The black border indicates that even though text within the border has been selected for harvesting (green), the rule set states that the block containing the text is to be excluded from the harvest.

    Due to the varying nature of web pages and the effectiveness of individual rule sets, you may need to manually select and deselect elements to ensure that you harvest exactly what you want.

    If you are not satisfied with the selection made by the rule set you can:

    Note: If you are harvesting a page to create or add to a gold standard and you are unable to select elements with the level of precision you require, you will be able to edit your selections when you create rules in the gold standard collection.

  5. Do one or more of the following:
  6. When you are satisfied with the text selected for harvesting, do one of the following:
  7. Do one of the following:
  8. A message confirms that the document has been sent to the ETA collection.

  9. To close the dialog click Close.