Spike Hijacking in Late-Interaction Retrieval
Karthik Suresh, Tushar Vatsa, Tracy King, Asim Kadav, Michael Friedrich

TL;DR
This paper analyzes how MaxSim pooling in late-interaction retrieval models causes gradient concentration and sensitivity to document length, revealing a tradeoff between discrimination and robustness.
Contribution
It provides a mechanistic study of gradient routing in MaxSim, demonstrating its biases and proposing the need for principled pooling alternatives.
Findings
MaxSim induces higher gradient concentration than smoothing methods.
Sparse routing improves early discrimination but increases length sensitivity.
MaxSim degrades more sharply with document length than smoothing variants.
Abstract
Late-interaction retrieval models rely on hard maximum similarity (MaxSim) to aggregate token-level similarities. Although effective, this winner-take-all pooling rule may structurally bias training dynamics. We provide a mechanistic study of gradient routing and robustness in MaxSim-based retrieval. In a controlled synthetic environment with in-batch contrastive training, we demonstrate that MaxSim induces significantly higher patch-level gradient concentration than smoother alternatives such as Top-k pooling and softmax aggregation. While sparse routing can improve early discrimination, it also increases sensitivity to document length: as the number of document patches grows, MaxSim degrades more sharply than mild smoothing variants. We corroborate these findings on a real-world multi-vector retrieval benchmark, where controlled document-length sweeps reveal similar brittleness under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
