Representing Sentences as Low-Rank Subspaces
Jiaqi Mu, Suma Bhat, Pramod Viswanath

TL;DR
This paper introduces a novel unsupervised method for representing sentences as low-rank subspaces of their word vectors, capturing semantic information effectively and outperforming neural models on similarity tasks.
Contribution
It proposes representing sentences as low-rank subspaces based on word vectors, revealing a simple geometric structure that improves semantic similarity performance.
Findings
Outperforms neural models by 15% on average in semantic similarity tasks.
Sentences' word vectors approximately lie in a low-rank subspace (rank 4).
The method is validated across 19 datasets.
Abstract
Sentences are important semantic units of natural language. A generic, distributional representation of sentences that can capture the latent semantics is beneficial to multiple downstream applications. We observe a simple geometry of sentences -- the word representations of a given sentence (on average 10.23 words in all SemEval datasets with a standard deviation 4.84) roughly lie in a low-rank subspace (roughly, rank 4). Motivated by this observation, we represent a sentence by the low-rank subspace spanned by its word vectors. Such an unsupervised representation is empirically validated via semantic textual similarity tasks on 19 different datasets, where it outperforms the sophisticated neural network models, including skip-thought vectors, by 15% on average.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
