Filtering

Filtering is a set of rules that need to be met for two network nodes to be considered for similarity. If those rules are met, Scoring section decides if they're equal. Otherwise, scoring is not performed and they're considered unequal.

If those rules are too generic, discrimination process (such as clustering) can be very slow.

Filtering configuration defines a set of rules. Filter is passed if the result of any rule for one network node is the same as the result of the same rule for another network node.

Possible rule types

Direct Value rule

Direct Value rule directly takes all values of a channel (as defined in Data Transformation section) directly, or after "simplification" process that makes different spellings equal.

The example rule below says that if two network nodes share the same "text" channel value, they might be equal. Scorer will decide if they are.

<rule class="com.sintelix.core.discrimination.clustering.filtering.DirectValueRule">
<channelName>text</channelName>
<simplified>false</simplified>
</rule>

Single Token rule

Single Token rule implies that any network object whose channel value is a single token might be equal to another network object, whose channel value is multi-token and one of those tokens is equal.

Simplified - if true, channel value is "simplified" first to normalise spelling variations.

The example rule below makes an entity with text "Smith" be considered to be potentially equal to any entity with text "John Smith".

<rule class="com.sintelix.core.discrimination.clustering.filtering.SingleTokenRule">
<channelName>text</channelName>
<simplified>false</simplified>
</rule>

All Tokens rule

All Tokens value rule says that two objects that share the same "significant" tokens might be equal. For example, entities with text "The GM Company" and "GM Corporation" might be equal because "GM" is significant while "the, "company" and "corporation" aren't.

<rule class="com.sintelix.core.discrimination.clustering.filtering.AllTokensRule">
<channelName>text</channelName>
<simplified>false</simplified>
<maxFrequency>0.01</maxFrequency>
</rule>

Common Prefix rule

Common Prefix rule implies that any two network objects, whose values in given channel start with a the same tokens, might be equal. Number of tokens to consider is configurable.

The example rule implies that an entity with text "John Smith" might be the same entity as "John Smith Jr.".

<rule class="com.sintelix.core.discrimination.clustering.filtering.CommonPrefixRule">
<channelName>text</channelName>
<simplified>false</simplified>
<length>2</length>
</rule>

Common Suffix rule

Common Suffix rule implies that any two network objects, whose values in given channel end with a the same tokens, might be equal. Number of tokens to consider is configurable.

The example rule implies that an entity with text "Smith Jr." might be the same entity as "John Smith Jr.".

<rule class="com.sintelix.core.discrimination.clustering.filtering.CommonSuffixRule">
<channelName>text</channelName>
<simplified>false</simplified>
<length>2</length>
</rule>

First And Last Tokens rule

First And Last Tokens rule implies that any two network objects, whose values in given channel start with a common token and end with a common token, might be equal.

<rule class="com.sintelix.core.discrimination.clustering.filtering.FirstAndLastTokensRule">
<channelName>text</channelName>
<simplified>false</simplified>
</rule>

The example rule implies that an entity with text "John Smith" might be the same entity as "John F. Smith".

 

fontfontfont