CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu; Cheng Qian; Zhaochen Su; Qing Zong; Shijue Huang; Bingxiang He; Yi R. Fung

arXiv:2511.02734·cs.AI·April 6, 2026

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung

PDF

1 Repo 1 Datasets

TL;DR

CostBench is a new benchmark for evaluating LLM agents' ability to plan cost-effectively and adapt to dynamic, unpredictable environments, highlighting current limitations in cost-aware reasoning.

Contribution

The paper introduces CostBench, a scalable, cost-centric benchmark with dynamic challenges to assess and improve LLM agents' economic reasoning and replanning skills.

Findings

01

Agents often fail to find cost-optimal solutions in static settings.

02

GPT-5 achieves less than 75% accuracy on complex tasks.

03

Performance drops by around 40% under dynamic conditions.

Abstract

Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiayujeff/CostBench
github

Datasets

JiayuJeff/CostBench
dataset· 85 dl
85 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.