Making Large Language Models Efficient Dense Retrievers
Yibin Lei, Shwai He, Ang Li, Andrew Yates

TL;DR
This paper analyzes layer redundancy in LLM-based dense retrievers and introduces EffiR, a framework that compresses MLP layers to make large language models more efficient for retrieval tasks without losing performance.
Contribution
The paper provides the first comprehensive analysis of layer redundancy in LLM-based dense retrievers and proposes EffiR, a novel MLP compression method tailored for retrieval tasks.
Findings
MLP layers are highly prunable in retrieval models.
Attention layers are critical for semantic aggregation.
EffiR reduces model size and inference cost significantly while maintaining performance.
Abstract
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
