CRISP: Clustering Multi-Vector Representations for Denoising and Pruning
Jo\~ao Veneroso, Rajesh Jayaram, Jinmeng Rao, Gustavo Hern\'andez \'Abrego, Majid Hadian, Daniel Cer

TL;DR
CRISP is a novel training method that learns inherently clusterable multi-vector representations, significantly reducing storage and computation in neural IR models while maintaining or improving retrieval performance.
Contribution
CRISP introduces an end-to-end training approach that embeds clustering into multi-vector models, outperforming post-hoc clustering and token pruning methods.
Findings
Achieves ~3x reduction in vectors with better performance than original models.
Attains 11x vector reduction with only 3.6% quality loss.
Effectively denoises representations by filtering irrelevant information.
Abstract
Multi-vector models, such as ColBERT, are a significant advancement in neural information retrieval (IR), delivering state-of-the-art performance by representing queries and documents by multiple contextualized token-level embeddings. However, this increased representation size introduces considerable storage and computational overheads which have hindered widespread adoption in practice. A common approach to mitigate this overhead is to cluster the model's frozen vectors, but this strategy's effectiveness is fundamentally limited by the intrinsic clusterability of these embeddings. In this work, we introduce CRISP (Clustered Representations with Intrinsic Structure Pruning), a novel multi-vector training method which learns inherently clusterable representations directly within the end-to-end training process. By integrating clustering into the training phase rather than imposing it…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper tackles an important and practical problem in neural information retrieval: the efficiency and redundancy of token-level multi-vector representations. 2. The experimental section is covering multiple BEIR tasks and including comparisons with multi-vector and single-vector models, including the pruned variants.
1. The main concern with this paper lies in its lack of substantial methodological innovation. While CRISP presents a training paradigm that integrates clustering directly into the end-to-end learning process of multi-vector retrieval models, this idea is not fundamentally new. Prior works, such as “Deep Clustering for Unsupervised Learning of Visual Features” and subsequent extensions, have already explored integrating k-means-style clustering objectives into representation learning frameworks.
The method proposed is very simple (and hence likely to be picked up and generate impact) and quite effective, as shown by the results on BEIR. The authors consider a number of reasonable baselines and train all models in a fair manner. This could enable a fair and scientific comparison, though it has the disadvantage of being somewhat disconnected from the broader literature on training techniques for many of these models.
The paper offers limited insight about the cost or complexity of running k-means clustering during training. How expensive or complex is this? Were any special tricks necessary for dealing with the gradient propagation? Why wasn't it done before, at least so effectively? Are there any theoretical or conceptual concerns that should be considered in clustering _per document_ versus clustering across the corpus? BEIR is a fairly old and "easy" / rather statured benchmark at this point. What about
- The paper presents the idea of integrating clustering into the training process, enabling the model to learn representations that are inherently more suitable for clustering. - Overall, the paper is clearly written and easy to follow.
- Insufficient baseline comparison: The authors only compare their method with other clustering-based approaches and very simple fixed-token pruning methods. However, recent research has actively explored dynamic token and representation compression techniques that enhance efficiency (e.g., [1]). Such methods may in fact address the three key challenges listed in lines 125–139 more directly, by compressing redundancy while minimising information loss, thereby improving computational efficiency.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsPruning
