FastLane: Efficient Routed Systems for Late-Interaction Retrieval
Ramnath Kumar, Prateek Jain, Cho-Jui Hsieh

TL;DR
FastLane is a new retrieval framework that significantly reduces the computational cost of late-interaction models by dynamically routing queries, enabling scalable, low-latency retrieval suitable for large-scale applications.
Contribution
It introduces a learnable routing mechanism that optimizes token-level interactions, bridging late-interaction models with Approximate Nearest Neighbor Search for improved efficiency.
Findings
Reduces computational complexity by up to 30x.
Maintains competitive retrieval performance.
Enables scalable, low-latency retrieval for large-scale systems.
Abstract
Late-interaction retrieval models like ColBERT achieve superior accuracy by enabling token-level interactions, but their computational cost hinders scalability and integration with Approximate Nearest Neighbor Search (ANNS). We introduce FastLane, a novel retrieval framework that dynamically routes queries to their most informative representations, eliminating redundant token comparisons. FastLane employs a learnable routing mechanism optimized alongside the embedding model, leveraging self-attention and differentiable selection to maximize efficiency. Our approach reduces computational complexity by up to 30x while maintaining competitive retrieval performance. By bridging late-interaction models with ANNS, FastLane enables scalable, low-latency retrieval, making it feasible for large-scale applications such as search engines, recommendation systems, and question-answering platforms.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
