Incorporating Token Importance in Multi-Vector Retrieval
Archish S, Ankit Garg, Kirankumar Shiragur, Neeraj Kayal

TL;DR
This paper enhances the ColBERT multi-vector retrieval model by incorporating token importance weights into the distance computation, improving retrieval performance on benchmark datasets.
Contribution
It introduces a weighted sum extension to the Chamfer distance in ColBERT, leveraging token importance to boost retrieval accuracy without retraining document representations.
Findings
Improved Recall@10 by 1.28% with IDF weights in zero-shot setting.
Achieved 3.66% higher Recall@10 after few-shot fine-tuning.
Method maintains efficiency by fixing multi-vector representations during training.
Abstract
ColBERT introduced a late interaction mechanism that independently encodes queries and documents using BERT, and computes similarity via fine-grained interactions over token-level vector representations. This design enables expressive matching while allowing efficient computation of scores, as the multi-vector document representations could be pre-computed offline. ColBERT models distance using a Chamfer-style function: for each query token, it selects the closest document token and sums these distances across all query tokens. In our work, we explore enhancements to the Chamfer distance function by computing a weighted sum over query token contributions, where weights reflect the token importance. Empirically, we show that this simple extension, requiring only token-weight training while keeping the multi-vector representations fixed, further enhances the expressiveness of late…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Advanced Image and Video Retrieval Techniques
