Running a batch harvest
For an overview of batch harvesting see Batch harvesting.
To view the batch harvesting workflow .
To run a batch harvest:
In the 'URLs or Search Terms' field enter the list of URLs you want to harvest and/or terms you want to search for using Google.
- Log in to ETA.
Note: Each time you open Chrome you will be prompted to disable developer mode extensions. Click Cancel. If you click Disable, ETA Harvester will be disabled. This message is not displayed in Google Chromium or Linux.
- Select the project in which you want to save the batch harvested documents, or create a new one.
The project dashboard is displayed.
- On the Main Navigation Bar click Harvester.
- Do one of the following:
Note: Each URL must be separated by a line. You can copy the list from a text file.
Click 'Add to Harvest Queue'.
The URLs and/or search terms are moved to the Harvest Queue.
Duplicates are automatically deleted and a message is displayed above the Harvest Queue to indicate this.
The Rule Set column indicates the rule set that will be used for each URL and search term.
If you want to:
Harvested is saved in documents (one per web page) in the ETA collection you selected. If you also want to harvest full pages (that is, content and ), click Batch Parameters then tick ‘Harvest full page (content and boilerplate elements)’. Each full page will be saved in a separate document in the same collection.
- delete a URL or search term from the Harvest Queue, click the trash can icon beside it
- clear everything from the Harvest Queue, click ‘Clear all URLs’
For more information see .
To mimic patterns of human interaction with websites, ETA Harvester can go to the websites in the Harvest Queue then wait a random amount of time (up to 60 seconds) before it begins harvesting text. To enable this setting, click Batch Parameters then tick ‘Wait random time before harvesting’.
Note: There are two ‘wait’ settings related to rule sets: ‘random wait’ (described above), which can be applied to all URLs in the Harvest Queue, and ‘rule set wait’, where you can configure individual rule sets to wait a specified amount of time (up to 60 seconds) before harvesting to enable pages to load completely (see ‘Wait Before Harvest’ in ).
To repeat the harvest at a regular interval (for example, every minute or every 24 hours), click Batch Parameters then tick ‘Repeat harvest every’. Select the interval from the dropdown list. To stop repeat harvests, deselect the checkbox.
The ‘Max Harvest Depth’ setting in a rule set specifies how many levels of hyperlinks Harvester will follow from the main URL. Do one or more of the following:
- To change the maximum depth for a rule set for the current batch harvest, click in the relevant cell in the Rule Set table then enter the maximum depth. If you do not want the rule set to follow any links from the main URL enter 1 as the maximum depth.
- To view a rule set click the name of it.
- To reset the Max Depth column to default values, click Reset. This will also reset the Disable column.
Do one of the following:
- To disable a rule set for the current batch harvest, check the relevant box in the Disable column of the Rule Set table.
- To reset the Disable column to default values, click Reset. This will also reset the Max Depth column.
Note: If you disable a rule set that affects one or more URLs in the Harvest Queue, they are highlighted in red.
To clear the Harvest Queue and all settings, right click over the screen then click Reload.
To begin the batch harvest click Batch Harvest.
Sites currently being harvested are shown in individual panes on the right. Up to four pages are harvested simultaneously.
One document is created per web page (unless you ticked ‘Harvest full page’, in which case a second document is created per web page). All of the documents are saved in the ETA collection you selected.
If the ‘maximum depth’ is greater than one, any follow on hyperlinks found on loaded pages are added to the Harvest Queue, and these pages are then harvested.
The Status column indicates the progress of each harvest. For a description of each status .
If necessary, do one or more of the following:
Add more URLs and/or search terms to the Harvest Queue
To add more URLs and/or search terms to the Harvest Queue:
- Click Pause Harvest.
- Enter or paste the URLs/search terms into the ‘URLs or Search Terms’ field.
- Click Add to Harvest Queue.
Note: If there are any duplicate URLs within the list you have just added, or any of these URLs are duplicates of URLs currently in the Harvest Queue, the duplicates are automatically deleted and a message is displayed above the Harvest Queue to indicate this.
- Click Resume Harvest.
Cancel the harvest
When the batch harvest has been completed you can:
To cancel the harvest:
- Click Pause, right click over the page then click Reload.
- View the harvested text as an annotated ETA document by clicking on the name of the document.
- View the (if you selected the 'Harvest full page' option) by clicking on the Show full page icon.
- Use ETA to process the harvested text as required.