Multi-reference Cosine: A New Approach to Text Similarity Measurement in   Large Collections

Hamid Mohammadi; Amin Nikoukaran

arXiv:1810.03099·cs.IR·October 9, 2018·1 cites

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

Hamid Mohammadi, Amin Nikoukaran

PDF

Open Access

TL;DR

This paper introduces Multi-reference Cosine, a novel, efficient, and scalable text similarity method that outperforms existing algorithms like Simhash in speed and accuracy for large web collections.

Contribution

The paper presents Multi-reference Cosine, a new text similarity approach combining dimensionality reduction and information gain, optimized for large-scale web page duplicate detection.

Findings

01

Faster than traditional cosine similarity

02

More accurate than Simhash

03

Effective on large datasets like NEWS20

Abstract

The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to prevent indexing a web page multiple times. Furthermore, in the scoring phase, search engines need similarity measures to detect duplicated contents on web pages so as to increase the quality of their results. In this paper, a new approach to batch text similarity detection is proposed by combining some ideas from dimensionality reduction techniques and information gain theory. The new approach is focused on search engines need to detect duplicated and near-duplicated web pages. The new approach is evaluated on the NEWS20 dataset and the results show that the new approach is faster than the cosine text similarity algorithm in terms of speed and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Spam and Phishing Detection · Advanced Text Analysis Techniques