Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li; Ryan Ming; Pranav Setlur; Abhijay Paladugu; Andy Tang; Hao Kang; Shuai Shao; Rong Jin; Chenyan Xiong

arXiv:2602.18998·cs.AI·February 24, 2026·2 cites

Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong

PDF

Open Access

TL;DR

This paper introduces General AgentBench, a comprehensive benchmark for evaluating general-purpose LLM agents across multiple domains, revealing significant performance challenges and limitations of current scaling methods.

Contribution

The paper presents a new unified benchmark for testing general LLM agents and systematically analyzes their scaling behaviors and limitations in realistic, multi-skill environments.

Findings

01

Performance drops when moving from domain-specific to general settings

02

Neither sequential nor parallel scaling improves performance effectively

03

Fundamental limitations include context ceiling and verification gap

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Topic Modeling · Multi-Agent Systems and Negotiation