A Fast Text Similarity Measure for Large Document Collections using   Multi-reference Cosine and Genetic Algorithm

Hamid Mohammadi; Seyed Hossein Khasteh

arXiv:1810.03102·cs.IR·September 26, 2019

A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

Hamid Mohammadi, Seyed Hossein Khasteh

PDF

TL;DR

This paper introduces a fast, scalable, and reliable text similarity measure for large document collections using multi-reference cosine signatures optimized by genetic algorithms, improving duplicate detection efficiency.

Contribution

It presents a novel signature-based text similarity method that employs genetic algorithms to generate optimal reference texts, enhancing duplicate detection in large datasets.

Findings

01

Comparable accuracy to state-of-the-art algorithms

02

High scalability and speed in large datasets

03

Reduced storage requirements

Abstract

One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.