Robust Implementation of Retrieval-Augmented Generation on Edge-based   Computing-in-Memory Architectures

Ruiyang Qin; Zheyu Yan; Dewen Zeng; Zhenge Jia; Dancheng Liu; Jianbo; Liu; Zhi Zheng; Ningyuan Cao; Kai Ni; Jinjun Xiong; Yiyu Shi

arXiv:2405.04700·cs.LG·May 9, 2024·1 cites

Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures

Ruiyang Qin, Zheyu Yan, Dewen Zeng, Zhenge Jia, Dancheng Liu, Jianbo, Liu, Zhi Zheng, Ningyuan Cao, Kai Ni, Jinjun Xiong, Yiyu Shi

PDF

Open Access

TL;DR

This paper introduces RoCR, a novel framework that leverages Computing-in-Memory architectures to accelerate Retrieval-Augmented Generation on edge devices, reducing latency and improving scalability without updating large language models.

Contribution

It presents the first use of CiM architectures to enhance RAG efficiency, incorporating contrastive and noise-aware training methods for better performance on edge devices.

Findings

01

RoCR significantly reduces retrieval latency on edge devices.

02

The framework improves scalability by handling growing user data efficiently.

03

Experimental results demonstrate enhanced retrieval speed and robustness.

Abstract

Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Graph Theory and Algorithms · Optimization and Search Problems

MethodsAttention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Residual Connection · Softmax · WordPiece