Sublinear Time Approximation of Text Similarity Matrices

Archan Ray; Nicholas Monath; Andrew McCallum; Cameron Musco

arXiv:2112.09631·cs.LG·April 28, 2022

Sublinear Time Approximation of Text Similarity Matrices

Archan Ray, Nicholas Monath, Andrew McCallum, Cameron Musco

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a sublinear time algorithm for approximating pairwise similarity matrices in NLP, including indefinite matrices, enabling efficient computation for large datasets with high accuracy in downstream tasks.

Contribution

It generalizes the Nyström method to indefinite similarity matrices and achieves sublinear time complexity, improving efficiency in NLP similarity approximations.

Findings

01

High accuracy in document classification tasks

02

Effective approximation of indefinite similarity matrices

03

Sublinear time complexity achieved

Abstract

We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $Ω (n^{2})$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

archanray/approximate_similarities
noneOfficial

Videos

Sublinear Time Approximation of Text Similarity Matrices· underline

Taxonomy

TopicsTopic Modeling · Multi-Criteria Decision Making · Bayesian Modeling and Causal Inference