ETA Harvester Guide

ETA Harvester is a configurable Google Chrome/Chromium extension that extracts text from web pages and sends it to a ETA collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents..

It offers two ways to harvest text: batch harvesting and page harvesting.

Batch harvesting enables you to automatically harvest text from multiple websites and send it to a ETA collection in one operation. It is useful for harvesting text from news and social media sites, and sites related to search terms of interest to you. See Batch harvesting.

Page harvesting enables you to manually select the elements you want to harvest on a web page then send the selection to a ETA collection. It is particularly useful when you are conducting an investigation by browsing the internet as you can harvest as much or as little as you like: the title and abstract of an article for example, a social media profile or a few paragraphs. See Page harvesting.

New ETA projects automatically contain a number of pre-defined rule sets for harvesting text from news sites, wikis, forums and Google searches, and from specific domains such as Twitter, Facebook and LinkedIn. Each rule set is designed to maximise the contentIn ETA Harvester, content is the text you want to harvest from a web page such as headings, authors, dates, captions and paragraphs (as opposed to the text you want to ignore from menus, sidebars and other boilerplate elements). that is harvested and minimise the boilerplate elementsElements on websites other than the content, such as navigation bars, side bars, footers, menus and advertisements.. The rule sets are in a configuration titled ‘Harvester Rule Sets’. You can customise these rule sets, delete them and/or create your own. See Rule sets.

Note: Harvester can be used to extract text from .onion sites using Tor. For more information see Harvesting content from the dark web.

To install ETA Harvester see Installing ETA Harvester.

About this guide

This guide is for ETA end users and system administrators. It describes how to install and use ETA Harvester, and how to create rule sets.

Many of the features of ETA, and access to them, are configurable. For this reason there may be small variations between the screen images in this guide and your installation of ETA, and you may not be able to access all the features described.

Most of the screen captures in this guide appear as thumbnails. To expand an image, click on it. To collapse it, click on it again.

Green text indicates a glossary term. Hover over the term to display the definition.

 

fontfontfont