A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

Zhipeng Liao; Kunming Shao; Jiangnan Yu; Liang Zhao; Tim Kwang-Ting Cheng; Chi-Ying Tsui; Jie Yang; Mohamad Sawan

arXiv:2510.27107·cs.AR·November 3, 2025

A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

Zhipeng Liao, Kunming Shao, Jiangnan Yu, Liang Zhao, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Jie Yang, Mohamad Sawan

PDF

Open Access

TL;DR

This paper introduces a hierarchical retrieval architecture for edge RAG in medical LLMs that reduces energy and memory use by nearly 50% and 75% respectively, while maintaining accuracy, suitable for privacy-preserving medical AI agents.

Contribution

It proposes a novel two-stage retrieval scheme for edge RAG that significantly improves energy efficiency and reduces memory access in medical LLM applications.

Findings

01

Memory access reduced by nearly 50%.

02

Energy consumption decreased by 75%.

03

Maintains high retrieval accuracy.

Abstract

With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare