A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm
Hamid Mohammadi, Seyed Hossein Khasteh

TL;DR
This paper introduces a fast, scalable, and reliable text similarity measure for large document collections using multi-reference cosine signatures optimized by genetic algorithms, improving duplicate detection efficiency.
Contribution
It presents a novel signature-based text similarity method that employs genetic algorithms to generate optimal reference texts, enhancing duplicate detection in large datasets.
Findings
Comparable accuracy to state-of-the-art algorithms
High scalability and speed in large datasets
Reduced storage requirements
Abstract
One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
