Large Language Models Are Overparameterized Text Encoders
Thennal D K, Tim Fischer, Chris Biemann

TL;DR
This paper demonstrates that large language models can be effectively pruned to reduce size and inference time with minimal performance loss, revealing their overparameterization for text embedding tasks.
Contribution
The authors introduce a simple pruning method and the L^3 Prune strategy, showing significant parameter reduction with negligible or modest performance impact.
Findings
Pruning up to 30% of layers causes negligible performance loss.
Pruning 74% of parameters results in only a 5.1 point performance decrease.
The method is easy to implement with just three lines of code.
Abstract
Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose , a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsPruning
