Identifying similar documents in a collection

When documents are uploaded in bulk, duplicate or very similar documents tend to get included. The Document Similarity Tool allows you to quickly find these documents so that you can manage them or remove them entirely from your collectionA collection is a container for storing and organising ingested files and documents. Only the textual content is stored in collections, not the original files and documents..

On the Collection Page the Document Similarity tool can be found on the top right hand corner of the Documents tab. There is a checkbox to its left that allows you to enable or disable the feature.

When similar documents are identified, this will be indicated by a number to the left of a document title as shown in the image below.

Selecting the toggle arrow immediately to the left of this number will reveal information on the specific document, including information on the similar documents that were identified.

When similar documents are found, they are not physically grouped together. What this means is that all the document will remain in the collection as independent documents and the collection size will not change.

Using the Tolerance Slider

The Tolerance Slider is used to adjust the match precision between two different documents. It can be adjusted to a value between 0 and 50 percent. A lower tolerance value will require higher similarity between documents for a match to be detected. For example, at 1%, two documents would need to be practically identical in content to be considered a match.

At 10%, the matching tolerance would be relaxed enough that things like emails from different people that contained mostly the same information, would be grouped together. It is recommended that the Tolerance Slider be set no higher than 10% if you want to detect documents that are very similar.

A tolerance above 10% can be used for broader filtering in order to identify documents which may be related in terms of subject matter.