LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference

Guangyuan Ma; Yongliang Ma; Xuanrui Gou; Zhenpeng Su; Ming Zhou; Songlin Hu

arXiv:2505.12260·cs.IR·February 2, 2026

LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference

Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, Songlin Hu

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

LightRetriever introduces a lightweight LLM-based text retrieval system that drastically speeds up query inference by replacing full query encoding with simple embedding lookups, while maintaining high retrieval accuracy.

Contribution

It proposes a novel approach that retains full document encoding with a minimal query encoding workload, enabling extremely faster retrieval without sacrificing performance.

Findings

01

Over 1000x speedup in query encoding compared to full LLMs

02

More than 10x increase in end-to-end retrieval throughput

03

Maintains 95% of retrieval performance across benchmarks

Abstract

Large Language Models (LLMs)-based text retrieval retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full LLM on an A800 GPU, our method achieves over 1000x speedup in query encoding and over 10x increase in end-to-end retrieval throughput. Extensive experiments on large-scale retrieval…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The paper's primary strength is its direct and effective solution to a practical, real-world bottleneck: the high cost and low throughput of online query encoding in LLM-based retrieval systems. The method reduces query encoding time by orders of magnitude (e.g., ~109s to 0.04s for an 8B model on a test batch), leading to a >10x increase in overall query-per-second (QPS) throughput. - The training methodology is simple and effective. - The paper's claims are well-supported by its ablations. A

Weaknesses

- Evaluations on more challenging IR tasks representing live, real world traffic (maybe include CoIR, BRIGHT). - Please add a discussion section (in Appendix if needed) that talks about performance within BeIR splits. For example, in the case of HotpotQA (multi-hop), FiQA, etc the performance drops are significant compared to full baselines. This aspect should be made clearer in the paper write-up (expanded more than L407 - L409). Currently, the paper reads as this approach being a one-solution

Reviewer 02Rating 4Confidence 2

Strengths

(1) The paper introduces a clear and practical asymmetric design that eliminates deep query-side inference while preserving full LLM power on the document side, delivering extreme online speedups with modest accuracy trade-offs. (2) The dense pathway’s cache-and-average mechanism and the sparse pathway’s LM-to-vocabulary projection with FLOPs-based sparsity are well formalized, and the training–caching–serving pipeline is technically sound and reproducible. (3) Empirical coverage is broad, spa

Weaknesses

(1) The approach still depends on full LLM query modeling during training; the paper does not explore reducing training-time cost through distillation, curricula, or lighter interim encoders while maintaining cached-token quality. (2) Evaluation focuses on academic text benchmarks; robustness on production-like settings with noisy queries, long or heterogeneous documents, domain shifts, and adversarial inputs is not addressed, limiting external validity. (3) Instruction conditioning is asymmet

Reviewer 03Rating 6Confidence 3

Strengths

1. This work addresses a critical practical bottleneck in deploying LLM-based retrievers: high online query latency, while achieving substantial throughput gains. 2. The "distill-to-embedding-bag" approach is novel. It ingeniously caches the word-level understanding of the instruction-aware $Enc_q$ into a simple lookup table, differing from standard knowledge distillation schemes. 3. The method is validated across multiple LLMs (Llama, Qwen) and benchmark datasets (BeIR, CMTEB-R), with fair co

Weaknesses

1. The dense query encoder employs a bag-of-words model, where $v_q^{den} = \frac{1}{n}\sum E[t_i]$. This model fails to capture query composability, meaning that for the model, the query vectors for "flights from Beijing to Shanghai" and "flights from Shanghai to Beijing" are identical. This fundamentally limits the model's ability to understand complex queries. 2. This approach trades significant online memory consumption (an 8B model requires approximately 1.05GB of embedding tables) for re

Code & Models

Repositories

caskcsg/lightretriever
pytorchOfficial

Datasets

lightretriever/lightretriever-finetune-data
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies