VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework
Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, Weijia Jia

TL;DR
The paper introduces VELO, a framework that uses vector database caching and multi-agent reinforcement learning to optimize LLM QoS at the edge, reducing delays and costs without modifying LLMs.
Contribution
VELO is a novel framework that leverages vector databases and MARL for cloud-edge LLM QoS optimization, applicable to diverse LLMs without internal modifications.
Findings
Significantly reduces response delay and resource consumption.
Enhances user satisfaction with improved QoS.
Effective in real edge system deployments.
Abstract
The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs, thereby impacting the Quality of Services (QoS) at the network edge. Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests, which has been overlooked by previous research. Addressing these gaps, this paper introduces a novel Vector database-assisted cloud-Edge collaborative LLM QoS Optimization (VELO) framework. Firstly, we propose the VELO framework, which ingeniously employs vector database to cache the results of some LLM requests at the edge to reduce the response time of subsequent similar requests. Diverging from direct optimization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management
