Fields and options on the Rule Set Configuration tab

Field or option

Description

Name

The name of the rule set.

To edit the name click the Rename icon beside the name in the ‘Harvester Rule Sets’ pane. Enter the new name then click Rename.

Note: When you create a rule set a new collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents. is automatically created with the same name and the suffix 'GS'. This indicates that it's a gold standard collection and it’s associated with the rule set. If you rename the rule set and want to maintain the association, consider renaming the collection as well.

Description

Enter a description of the rule set.

Batch Harvest Parameters

 

URL Patterns

Enter the domain or URL patterns you want the rule set to harvest.

To enter the URLs of the gold standard documents in the collection, click Infer from documents.

Note: If the rule set is for a specific domain, enter the domain, using wildcards if necessary, for example *nytimes.com*

If the rule set is more generic, enter specific domains and/or parts of URLs, using wildcards to make them as generic as possible. For example.

*.*blog*.com*

*.*news*.com*

*.*post*.com*

*.*press*.com*

*.abc.*

*/article*

*/news*

*/story*

Pre-append Text

You can add a label or title to the top of each batch harvested document. For example, if you want to record the name of the rule set used to harvest the content of the document, enter the name of the rule set in this field.

Rule Set Priority

You can assign priorities to rule sets so that if a rule set is used in batch harvesting and more than one rule set could be used to harvest a specific URL or search term, the set with the highest priority is used to select content from the site.

For example, you may have a rule set for a specific news site such as The New York Times and a more generic rule set for other news sites. Giving The New York Times rule set a higher priority ensures that it will be applied to pages on The New York Times website and not the generic rule set for news sites.

Harvester contains a rule set named ‘Last Resort’ with a priority of 0 (the lowest). This is a very generic rule set and is only used when no other rule sets match any given web page. However, because it is so generic the results may not be as successful as rule sets created for specific domains or URLs.

Enter the priority for this rule set.

Max Harvest Depth

Enter the number of hyperlink levels you want Harvester to follow from the main URL. If you do not want Harvester to follow any hyperlinks enter 1.

Wait Before Harvest

If you want Harvester to wait a specific time period before harvesting—to enable pages to load completely—select the time period. During a batch harvest this waiting period is indicated by the status ‘Waiting (rule set)’.

Note: This setting is not the same as the ‘Wait random time before harvesting’ setting in batch harvesting, which enables Harvester to mimic patterns of human interaction with websites by waiting a random amount of time (up to 60 seconds) before harvesting begins. The ‘Wait random time before harvesting’ setting, when selected, applies to all the URLs in the Harvest Queue. During a batch harvest this waiting period is indicated by the status ‘Waiting (random)’.

Harvest Links Only

If you do not want the rule to harvest content from the main URL but to follow hyperlinks and harvest the content from those links, tick this checkbox.

Note: If you select this option, make sure you enter the maximum number of hyperlink levels you want Harvester to follow in the Max Harvest Depth field.

 

fontfontfont