Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
Hayato Tsukagoshi, Ryohei Sasano

TL;DR
This paper investigates the redundancy and intrinsic properties of prompt-based text embeddings, showing that significant dimensionality reduction minimally impacts performance, especially for classification and clustering tasks.
Contribution
It provides a comprehensive analysis of the redundancy, isotropy, and intrinsic dimensionality of prompt-based embeddings, highlighting their high redundancy and robustness to dimensionality reduction.
Findings
Naive dimensionality reduction causes minimal performance loss.
Embeddings for classification and clustering have lower intrinsic dimensionality.
High-dimensional embeddings exhibit high redundancy and less isotropy.
Abstract
Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Natural Language Processing Techniques · Topic Modeling
