PerCache: Predictive Hierarchical Cache for RAG Applications on Mobile Devices
Kaiwei Liu, Liekang Zeng, Lilin Xu, Bufang Yang, Zhenyu Yan

TL;DR
PerCache is a hierarchical caching system designed for mobile RAG applications that predicts and reuses intermediate results to significantly reduce latency, adapting dynamically to system load changes.
Contribution
It introduces a novel hierarchical cache architecture with predictive query population and adaptive configuration for mobile RAG systems.
Findings
Achieves 34.4% latency reduction over baselines
Effective cache hit rate improvement through prediction
Maintains latency performance under dynamic system loads
Abstract
Retrieval-augmented generation (RAG) has been extensively used as a de facto paradigm in various large language model (LLM)-driven applications on mobile devices, such as mobile assistants leveraging personal emails or meeting records. However, due to the lengthy prompts and the resource constraints, mobile RAG systems exhibit significantly high response latency. On this issue, one promising approach is to reuse intermediate computational results across different queries to eliminate redundant computation. But most existing approaches, such as KV cache reuse and semantic cache reuse, are designed for cloud settings and perform poorly, overlooking the distinctive characteristics of mobile RAG. We propose PerCache, a novel hierarchical cache solution designed for reducing end-to-end latency of personalized RAG applications on mobile platforms. PerCache adopts a hierarchical architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Caching and Content Delivery
