A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents
Zhipeng Liao, Kunming Shao, Jiangnan Yu, Liang Zhao, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Jie Yang, Mohamad Sawan

TL;DR
This paper introduces a hierarchical retrieval architecture for edge RAG in medical LLMs that reduces energy and memory use by nearly 50% and 75% respectively, while maintaining accuracy, suitable for privacy-preserving medical AI agents.
Contribution
It proposes a novel two-stage retrieval scheme for edge RAG that significantly improves energy efficiency and reduces memory access in medical LLM applications.
Findings
Memory access reduced by nearly 50%.
Energy consumption decreased by 75%.
Maintains high retrieval accuracy.
Abstract
With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
