Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections
Hamid Mohammadi, Amin Nikoukaran

TL;DR
This paper introduces Multi-reference Cosine, a novel, efficient, and scalable text similarity method that outperforms existing algorithms like Simhash in speed and accuracy for large web collections.
Contribution
The paper presents Multi-reference Cosine, a new text similarity approach combining dimensionality reduction and information gain, optimized for large-scale web page duplicate detection.
Findings
Faster than traditional cosine similarity
More accurate than Simhash
Effective on large datasets like NEWS20
Abstract
The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to prevent indexing a web page multiple times. Furthermore, in the scoring phase, search engines need similarity measures to detect duplicated contents on web pages so as to increase the quality of their results. In this paper, a new approach to batch text similarity detection is proposed by combining some ideas from dimensionality reduction techniques and information gain theory. The new approach is focused on search engines need to detect duplicated and near-duplicated web pages. The new approach is evaluated on the NEWS20 dataset and the results show that the new approach is faster than the cosine text similarity algorithm in terms of speed and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Spam and Phishing Detection · Advanced Text Analysis Techniques
