LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference
Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, Songlin Hu

TL;DR
LightRetriever introduces a lightweight LLM-based text retrieval system that drastically speeds up query inference by replacing full query encoding with simple embedding lookups, while maintaining high retrieval accuracy.
Contribution
It proposes a novel approach that retains full document encoding with a minimal query encoding workload, enabling extremely faster retrieval without sacrificing performance.
Findings
Over 1000x speedup in query encoding compared to full LLMs
More than 10x increase in end-to-end retrieval throughput
Maintains 95% of retrieval performance across benchmarks
Abstract
Large Language Models (LLMs)-based text retrieval retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full LLM on an A800 GPU, our method achieves over 1000x speedup in query encoding and over 10x increase in end-to-end retrieval throughput. Extensive experiments on large-scale retrieval…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper's primary strength is its direct and effective solution to a practical, real-world bottleneck: the high cost and low throughput of online query encoding in LLM-based retrieval systems. The method reduces query encoding time by orders of magnitude (e.g., ~109s to 0.04s for an 8B model on a test batch), leading to a >10x increase in overall query-per-second (QPS) throughput. - The training methodology is simple and effective. - The paper's claims are well-supported by its ablations. A
- Evaluations on more challenging IR tasks representing live, real world traffic (maybe include CoIR, BRIGHT). - Please add a discussion section (in Appendix if needed) that talks about performance within BeIR splits. For example, in the case of HotpotQA (multi-hop), FiQA, etc the performance drops are significant compared to full baselines. This aspect should be made clearer in the paper write-up (expanded more than L407 - L409). Currently, the paper reads as this approach being a one-solution
(1) The paper introduces a clear and practical asymmetric design that eliminates deep query-side inference while preserving full LLM power on the document side, delivering extreme online speedups with modest accuracy trade-offs. (2) The dense pathway’s cache-and-average mechanism and the sparse pathway’s LM-to-vocabulary projection with FLOPs-based sparsity are well formalized, and the training–caching–serving pipeline is technically sound and reproducible. (3) Empirical coverage is broad, spa
(1) The approach still depends on full LLM query modeling during training; the paper does not explore reducing training-time cost through distillation, curricula, or lighter interim encoders while maintaining cached-token quality. (2) Evaluation focuses on academic text benchmarks; robustness on production-like settings with noisy queries, long or heterogeneous documents, domain shifts, and adversarial inputs is not addressed, limiting external validity. (3) Instruction conditioning is asymmet
1. This work addresses a critical practical bottleneck in deploying LLM-based retrievers: high online query latency, while achieving substantial throughput gains. 2. The "distill-to-embedding-bag" approach is novel. It ingeniously caches the word-level understanding of the instruction-aware $Enc_q$ into a simple lookup table, differing from standard knowledge distillation schemes. 3. The method is validated across multiple LLMs (Llama, Qwen) and benchmark datasets (BeIR, CMTEB-R), with fair co
1. The dense query encoder employs a bag-of-words model, where $v_q^{den} = \frac{1}{n}\sum E[t_i]$. This model fails to capture query composability, meaning that for the model, the query vectors for "flights from Beijing to Shanghai" and "flights from Shanghai to Beijing" are identical. This fundamentally limits the model's ability to understand complex queries. 2. This approach trades significant online memory consumption (an 8B model requires approximately 1.05GB of embedding tables) for re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
