Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin; Hui-Ling Zhen; Shuai Yang; Xian Wang; Renxi Liu; Hanting Chen; Wangze Zhang; Chuansai Zhou; Yiming Li; Chen Chen; Xing Li; Zhiyuan Yang; Xiaosong Li; Xianzhi Yu; Zhenhua Dong; Mingxuan Yuan; Yunhe Wang

arXiv:2512.18337·cs.CL·February 25, 2026

Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen, Wangze Zhang, Chuansai Zhou, Yiming Li, Chen Chen, Xing Li, Zhiyuan Yang, Xiaosong Li, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan, Yunhe Wang

PDF

Open Access

TL;DR

This paper introduces AgentInfer, a comprehensive framework that significantly improves the efficiency of LLM-based agents by optimizing inference architecture and system design, enabling faster, more scalable autonomous reasoning.

Contribution

The paper proposes a novel, integrated system combining hierarchical reasoning, cache-aware scheduling, speculative decoding, and semantic compression for end-to-end agent acceleration.

Findings

01

Over 50% reduction in ineffective token consumption

02

Achieves 1.8-2.5x speedup in agent reasoning tasks

03

Maintains accuracy while enhancing efficiency

Abstract

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Multi-Agent Systems and Negotiation · Reinforcement Learning in Robotics