Batch harvesting

Batch harvesting enables you to automatically harvest text from multiple websites and send it to a ETA collection in one operation. It is useful for harvesting text from news and social media sites, and sites related to search terms of interest to you.

Effective rule sets are critical to the success of batch harvesting. ETA Harvester New ETA projects automatically contain a number of pre-defined rule sets for harvesting text from news sites, wikis, forums and Google searches, and from specific domains such as Twitter, Facebook and LinkedIn. Each rule set is designed to maximise the contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). that is harvested and minimise the boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements.. The rule sets are in a configuration titled ‘Harvester Rule Sets’. You can customise these rule sets, delete them and/or create your own. For more information see Rule sets.

Prior to a batch harvest you can change two settings without modifying the rule sets themselves. You can:

You can also enable Harvester to mimic patterns of human interaction with websites by waiting a random amount of time (up to 60 seconds) before harvesting begins.

Harvested content is saved in documents—one per web page—in the ETA collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents. of your choice. If you want to create a rule set and gold standard, you also need to harvest full pages (that is, content and boilerplate elements). The ‘Harvest full page’ setting on the Harvester tab enables you to do this. Each full page is saved in a separate document in the same collection.

To view the batch harvesting workflow click here.

For detailed steps see Running a batch harvest.

Figure 1: Using the Batch Harvest feature

Notes:  

 

fontfontfont