The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs
Mert Yazan, Suzan Verberne, Frederik Situmeang

TL;DR
This paper investigates how post-training quantization impacts the performance of small LLMs in retrieval-augmented generation tasks, especially with long contexts, showing that quantized models can still perform well in certain conditions.
Contribution
It provides an analysis of quantization effects on small LLMs' retrieval-augmented generation capabilities, highlighting conditions where quantization does not impair performance.
Findings
Quantization does not impair performance of well-performing 7B LLMs in RAG tasks.
Long-context reasoning remains effective in quantized models under certain conditions.
Different retrieval models influence the performance of quantized LLMs in long-context tasks.
Abstract
Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. Since LLM abilities emerge with scale, smaller LLMs are more sensitive to quantization. In this paper, we explore how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG), specifically in longer contexts. We chose personalization for evaluation because it is a challenging domain to perform using RAG as it requires long-context reasoning over multiple documents. We compare the original FP16 and the quantized INT4 performance of multiple 7B and 8B LLMs on two tasks while progressively increasing the number of retrieved documents to test how quantized models fare against longer contexts. To better understand the effect of retrieval, we evaluate three retrieval models in our experiments. Our findings reveal that if a 7B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · WordPiece · Residual Connection · Byte Pair Encoding · Layer Normalization · Attention Dropout
