MatKV: Trading Compute for Flash Storage in LLM Inference
Kun-Woo Shin (1), Jay H. Park (2), Moonwook Oh (2), Yohan Jo (1), Jaeyoung Do (1), Sang-Won Lee (1) ((1) Seoul National University, Korea (2) Samsung Electronics, Korea)

TL;DR
MatKV significantly improves the efficiency of retrieval-augmented generation inference by precomputing and storing key-value vectors in flash storage, reducing time and power consumption while maintaining accuracy.
Contribution
This paper introduces MatKV, a novel scheme that precomputes and materializes key-value vectors in flash storage to optimize RAG inference, enabling faster and more energy-efficient AI applications.
Findings
Halves inference time and power consumption for RAG workloads.
Enables use of low-end GPUs for decoding without speed loss.
Maintains accuracy in question-answering tasks.
Abstract
We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face's Transformers library across state-of-the-art GPUs and flash memory SSDs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Ferroelectric and Negative Capacitance Devices
