Agent-X: Full Pipeline Acceleration of On-device AI Agents
Jinha Chung, Byeongjun Shin, Jiin Kim, Minsoo Rhu

TL;DR
Agent-X is a software framework that significantly speeds up on-device AI agents by optimizing prompt processing and decoding, achieving over 1.6x faster performance without accuracy loss.
Contribution
It introduces prompt rewriting and speculative decoding techniques to reduce latency in on-device AI agents, a novel approach in this domain.
Findings
Achieves 1.61x end-to-end speedup on representative workloads.
No accuracy loss observed with the proposed acceleration methods.
First systematic characterization and elimination of latency bottlenecks in on-device agents.
Abstract
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
