Benchmark Test-Time Scaling of General LLM Agents
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong

TL;DR
This paper introduces General AgentBench, a comprehensive benchmark for evaluating general-purpose LLM agents across multiple domains, revealing significant performance challenges and limitations of current scaling methods.
Contribution
The paper presents a new unified benchmark for testing general LLM agents and systematically analyzes their scaling behaviors and limitations in realistic, multi-skill environments.
Findings
Performance drops when moving from domain-specific to general settings
Neither sequential nor parallel scaling improves performance effectively
Fundamental limitations include context ceiling and verification gap
Abstract
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Topic Modeling · Multi-Agent Systems and Negotiation
