Fields and options on the Rule dialog

Field or option

Description

Name

Rules are named automatically, based on the element they select (that is, the tag, not the class). Modify the name if required. Rule names do not need to be unique.

POS/NEG

Rules select elements. If the rule is positive, the element will be harvested. If the rule is negative, the element will be excluded from the harvest.

Negative rules override positive rules. For example, if a rule selects the text in a table of contents but another rule excludes the entire table of contents, the text in the table of contents will not be harvested.

By default, rules are positive. To change a rule to negative, click the POS toggle beside the Name field.

Embed rule name into extracted documents

If you want to create a native annotation (that is, a custom tag) in ETA for text harvested by this rule, tick this box.

To process the output from the rule you will need to create an entity extraction script. For information see Entity Extraction Scripts configuration in the ETA Configurations Guide.

Tag as

If you want to apply a ETA tag to text harvested by the rule, select the tag from the dropdown list. To create a new tag select ‘Custom’ then enter the name of the tag.

For example, you have created a rule to harvest the ‘Poster’ element on a forum, which will gather the identifiers of people who have posted comments. Identifiers on the forum are typically one word, such as ‘BlueFin’ or ‘Fullarton 125’, so ETA won’t recognise them as names. By creating a custom tag called ’Poster’ you will be able to identify this information in ETA later.

Notes

Enter any notes relevant to the rule.

Rule Path

 

(Edit classes)

To show the classes associated with the tags, expand the path by clicking the arrow on the left, then expand individual tags by clicking the arrow beside them.

To delete a tag click the trash can icon beside it.

To find more classes for a tag in the current path in the current document, click the magnifying glass icon. From the dropdown list select the class you want to add, then click the plus button beside the list. You can add as many classes as required.

Note: The classes in the dropdown list are from the tag you used to create the rule, and only classes from the current document are listed.

Classes are ‘ANDed’ together and can be positive, negative or neutral.

To change the state of a class, click the box beside it until the state you want is displayed.

When a rule is created, all the classes are positive by default. For this element tag of the rule to be executed, all of the classes marked as positive must be in the class list of the tag on the web page.

To prevent the rule from being executed when one or more specific classes are detected, make those classes negative.

If you want the rule to ignore one or more classes, make those class neutral.

Allow tags in-between

If you want the rule to be executed when the tags shown below this checkbox (tags only, not classes) are matched, regardless of other tags that may be between them, tick the checkbox.

For example, if the rule path is DIV > SECTIONP

  • a document path of ARTICLE > DIV > SECTION > P will be matched whether the checkbox is ticked or not
  • a document path of DIV > ARTICLE > SECTION > P will only be matched if the checkbox is ticked

Auto Simplify

By default, the rule will only be executed when the exact tags and classes in the path are detected.

To automatically remove extraneous tags and classes from the path, to the point where the effect of the rule on the gold standard is not changed, click this button.

Simplified rules are more generic and run faster than rules with more complex paths.

Note: You can automatically simplify every rule in the set by clicking the ‘Auto Simplify All Rules’ button on the Rules tab.

Advanced

 

Keywords

If you want a rule to be executed only when text matches one or more keywords, tick the Keywords box then enter the keyword or keywords, one per line, in the field.

To match text within a word, use a wildcard character either side.

An asterisk (*) represents multiple characters.

A question mark (?) represents a single character.

Text length

If you want this rule to be executed only when the text length is within a specific range, tick the Text Length box then enter the range, in characters.

For example, you have created a rule set to harvest articles from random news sites. You want a rule to harvest author’s names so you enter a keyword of ‘By’ and a text length range of 4 to 300.

Conditional selection

If you want this rule to be executed only when:

  • the previous tag has also been selected, tick ‘Conditional selection’ then select ‘Previous’ from the dropdown list
  • a specific H tag above has also been selected, tick ‘Conditional selection’ then select the H tag from the dropdown list
  • any H, P or UL tag above has also been selected, tick ‘Conditional selection’ then select ‘H*’, ‘P’ or ‘UL’ from the dropdown list

For example, if the rule selects P tags but you only want it to be executed when the previous tag (whatever it may be) has also been selected by this rule or another, tick ‘Conditional selection’ then select ’Previous’. If you only want the rule to be executed when any H tag above has also been selected, select ‘H*’ from the dropdown list.

Pre-click before other rules

In batch harvesting, if you want ETA Harvester to expand hidden content by simulating mouse clicks before harvesting, tick this box.

For example, there is a ‘Show more’ button on a web page. In the rule set that selects this button the ‘Pre-click before other rules’ option has been selected. The button is automatically clicked before harvesting begins so that the additional content is shown and can be harvested.

Note: The ‘Selected by Rule Set’ tab lists the rules in which the pre-click parameter has been selected, and shows the effect of these rules.

The order in which rules are applied to a URL in batch harvesting is:

  1. If the setting ‘Wait random time before harvesting’ has been ticked, the rule set waits a random time.
  2. The page is loaded.
  3. If the ‘Pre-click before other rules’ option has been ticked in any rules, the buttons to which these rules apply (for example ‘Show more’) are clicked and Harvester waits for the duration of the ‘rule set wait’ period to allow this content to be loaded.
  4. Note: There are two ‘wait’ settings related to rule sets:

    • ‘rule set wait’, where you can configure individual rule sets to wait a specified amount of time (up to 60 seconds) before harvesting to enable pages to load completely
    • ‘random wait’, where ETA Harvester goes to the websites in the Harvest Queue then waits a random amount of time (up to 60 seconds) before it begins harvesting text to mimic patterns of human interaction with websites
  5. Positive rules are applied.
  6. Negative rules are applied (overriding positive rules where there is contention).
  7. Rules that require a previous element to be selected (h1, h*, p etc) are applied.
  8. Rules that require (any) previous element to be selected are applied.
  9. Links to be pushed to the batch queue, if the current depth <= harvest depth, are selected.

Harvest hyperlinks

In batch harvesting, if you want to harvest links found in elements selected by this rule (so ETA Harvester can visit these sites and harvest text from them), tick this box.

Add href attribute from A tag

This option is only available when you click on a rule where the last tag in the rule path is an ‘A’ tag. It enables you to harvest URLs from href attributes (for example, to extract hidden resource IDs from URLs).

Note: You may need to write an entity extraction script to achieve the result you require.

 

fontfontfont