CSRv2: Unlocking Ultra-Sparse Embeddings
Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You

TL;DR
CSRv2 introduces a training method that enables ultra-sparse embeddings, significantly reducing inactive neurons and computational costs while maintaining high performance in text and vision tasks.
Contribution
It presents CSRv2, a novel training approach that stabilizes ultra-sparse embeddings, improves their quality, and makes them practical for real-world applications.
Findings
Reduces dead neurons from 80% to 20%.
Achieves 14% accuracy gain at k=2.
Provides 7x speedup over MRL and 300x efficiency improvements.
Abstract
In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through…
Peer Reviews
Decision·ICLR 2026 Poster
1. The analysis of failure modes of CSR in the ultra sparse regime exposing dead neurons with and without annealing is interesting. 2. Evaluation setup is exhaustive and the accuracy improvements are compelling. The efficiency results on retrieval are also strong and indicate that the method should scale well to practical settings.
1. Novelty of the method is fairly limited. K-annealing and supervised contrastive objectives have been discussed (and experimented with a lot) before in prior work, as has been acknowledged by the authors. 2. The results from annealing itself are relatively small compared to adding supervision and finetuning, which raises the concern to me whether the improvements are coming from the sparse learning principle, or mostly engineering training recipe. 3. The efficiency cost focuses on retrie
In general, improving any algorithm is great, and this work improves CSR, a fairly recent paradigm for ultrasparse representations, as well as providing improvements over the equivalent dimensional MRL representations.
I feel this work has several weaknesses, from novelty to an architectural point of view, leading me to reject this work. 1. On curriculum: Do we always want a backbone that is only producing a sparse representation of specific sparsity? I believe that a more practical setup is where you've one single backbone, which is trained on multiple top-k values, from the largest desirable sparsity to the smallest sparsity. This current setup seems to be impractical, as I believe if there is a backbone t
* This paper addresses an important problem of learning sparse embeddings that are compute optimal for their memory and inference time, showing strong empirical performance on real-world embedding benchmarks like MTEB and ImageNet-1K. * The paper is written and presented very well, and is easy to follow. The figures and tables are clean, interpretable, and provide supporting evidence to the claims in the paper * Design choices made by the authors are empirically validated to show the benefits of
I do not have any major concerns with this work. I will highlight several minor weaknesses that could be addressed to further improve its quality: * I suggest rearranging the abstract content to be more specific about empirical performance on each dataset/task: * L22 - L26: *"delivers a 14% accuracy gain at k = 2 ... a 7× speedup over MRL... 300× improvements in compute and memory efficiency"* - gain /speedup with respect to what and on what data? * For L27 - 31, how much (metric value)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
