Running a batch harvest

For an overview of batch harvesting see Batch harvesting.

To view the batch harvesting workflow click here.

To run a batch harvest:

  1. Log in to ETA.
  2. Note: Each time you open Chrome you will be prompted to disable developer mode extensions. Click Cancel. If you click Disable, ETA Harvester will be disabled. This message is not displayed in Google Chromium or Linux.

  3. Select the project in which you want to save the batch harvested documents, or create a new one.
  4. The project dashboard is displayed.

  5. On the Main Navigation Bar click Harvester.
  6. Do one of the following:
  7. In the 'URLs or Search Terms' field enter the list of URLs you want to harvest and/or terms you want to search for using Google.
  8. Note: Each URL must be separated by a line. You can copy the list from a text file.

  9. Click 'Add to Harvest Queue'.
  10. The URLs and/or search terms are moved to the Harvest Queue.

    Duplicates are automatically deleted and a message is displayed above the Harvest Queue to indicate this.

    The Rule Set column indicates the rule set that will be used for each URL and search term.

  11. If you want to:
  12. Harvested contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). is saved in documents (one per web page) in the ETA collection you selected. If you also want to harvest full pages (that is, content and boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements.), click Batch Parameters then tick ‘Harvest full page (content and boilerplate elements)’. Each full page will be saved in a separate document in the same collection.
  13. For more information see Rule sets.

  14. To mimic patterns of human interaction with websites, ETA Harvester can go to the websites in the Harvest Queue then wait a random amount of time (up to 60 seconds) before it begins harvesting text. To enable this setting, click Batch Parameters then tick ‘Wait random time before harvesting’.
  15. Note: There are two ‘wait’ settings related to rule sets: ‘random wait’ (described above), which can be applied to all URLs in the Harvest Queue, and ‘rule set wait’, where you can configure individual rule sets to wait a specified amount of time (up to 60 seconds) before harvesting to enable pages to load completely (see ‘Wait Before Harvest’ in Fields and options on the Rule Set Configuration tab).

  16. To repeat the harvest at a regular interval (for example, every minute or every 24 hours), click Batch Parameters then tick ‘Repeat harvest every’. Select the interval from the dropdown list. To stop repeat harvests, deselect the checkbox.
  17. The ‘Max Harvest Depth’ setting in a rule set specifies how many levels of hyperlinks Harvester will follow from the main URL. Do one or more of the following:
  18. Do one of the following:
  19. Note: If you disable a rule set that affects one or more URLs in the Harvest Queue, they are highlighted in red.

  20. To clear the Harvest Queue and all settings, right click over the screen then click Reload.
  21. To begin the batch harvest click Batch Harvest.
  22. Sites currently being harvested are shown in individual panes on the right. Up to four pages are harvested simultaneously.

    One document is created per web page (unless you ticked ‘Harvest full page’, in which case a second document is created per web page). All of the documents are saved in the ETA collection you selected.

    If the ‘maximum depth’ is greater than one, any follow on hyperlinks found on loaded pages are added to the Harvest Queue, and these pages are then harvested.

    The Status column indicates the progress of each harvest. For a description of each status click here.

  23. If necessary, do one or more of the following:
  24. When the batch harvest has been completed you can:

 

 

fontfontfont