A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
Julian Killingback, Ofer Meshi, Henry Li, Hamed Zamani, Maryam Karimzadehgan

TL;DR
This paper introduces a unified model for on-device retrieval-augmented generation that compresses context and uses shared representations to reduce memory and storage needs while maintaining performance.
Contribution
It proposes the first unified model that combines retrieval and context compression with shared representations, optimizing on-device RAG systems.
Findings
Model uses 1/10 of the context size of traditional RAG.
Matches traditional RAG performance without increasing storage.
Reduces disk and memory usage through shared representations.
Abstract
Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
