HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC
Maoliang Li, Jiayu Chen, Zihao Zheng, Ziqian Li, Xinhao Sun, Guojie Luo, Chenchen Liu, Xiang Chen

TL;DR
HeRo is a framework that optimizes the deployment of agentic retrieval-augmented generation on mobile SoCs, significantly reducing latency by intelligently scheduling heterogeneous models and workflows.
Contribution
HeRo introduces a profiling-based performance modeling and a lightweight online scheduler for efficient, low-latency agentic RAG on mobile heterogeneous SoCs.
Findings
Up to 10.94x latency reduction compared to existing methods.
Effective handling of heterogeneous models and dynamic workflows.
Practical on-device agentic RAG enabled.
Abstract
With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Embedded Systems Design Techniques
