Document Similarity from Vector Space Densities

Ilia Rushkin

arXiv:2009.00672·cs.CL·September 3, 2020

Document Similarity from Vector Space Densities

Ilia Rushkin

PDF

TL;DR

This paper introduces the density similarity (DS) method, a fast and semantically aware approach for estimating document similarities using word embeddings and kernel regression, matching state-of-the-art accuracy with improved speed.

Contribution

The paper presents a novel, computationally efficient similarity measure for text documents that incorporates semantic relations via word embeddings and kernel regression.

Findings

01

DS method achieves accuracy comparable to state-of-the-art methods.

02

DS method significantly improves computational speed.

03

New generalized metrics for top-k accuracy and Jaccard similarity introduced.

Abstract

We propose a computationally light method for estimating similarities between text documents, which we call the density similarity (DS) method. The method is based on a word embedding in a high-dimensional Euclidean space and on kernel regression, and takes into account semantic relations among words. We find that the accuracy of this method is virtually the same as that of a state-of-the-art method, while the gain in speed is very substantial. Additionally, we introduce generalized versions of the top-k accuracy metric and of the Jaccard metric of agreement between similarity models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.