Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures
Ruiyang Qin, Zheyu Yan, Dewen Zeng, Zhenge Jia, Dancheng Liu, Jianbo, Liu, Zhi Zheng, Ningyuan Cao, Kai Ni, Jinjun Xiong, Yiyu Shi

TL;DR
This paper introduces RoCR, a novel framework that leverages Computing-in-Memory architectures to accelerate Retrieval-Augmented Generation on edge devices, reducing latency and improving scalability without updating large language models.
Contribution
It presents the first use of CiM architectures to enhance RAG efficiency, incorporating contrastive and noise-aware training methods for better performance on edge devices.
Findings
RoCR significantly reduces retrieval latency on edge devices.
The framework improves scalability by handling growing user data efficiently.
Experimental results demonstrate enhanced retrieval speed and robustness.
Abstract
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Graph Theory and Algorithms · Optimization and Search Problems
MethodsAttention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Residual Connection · Softmax · WordPiece
