$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

Peijie Yu; Yifan Yang; Jinjian Li; Zelong Zhang; Haorui Wang; Xiao Feng; Feng Zhang

arXiv:2505.18746·cs.AI·June 30, 2025

$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang

PDF

1 Datasets

TL;DR

This paper introduces $C^3$-Bench, a comprehensive benchmark designed to evaluate large language model-based agents' robustness in multi-tasking environments, focusing on complex tool interactions, hidden information, and dynamic decision-making.

Contribution

The paper presents a new benchmark, $C^3$-Bench, with challenges and metrics to assess and analyze the robustness of LLM-based agents in complex, multi-tasking scenarios.

Findings

01

Agents struggle with tool dependency management

02

Long context handling is a significant challenge

03

Frequent policy switching indicates robustness issues

Abstract

Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^{3}$ -Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tencent/C3-BenchMark
dataset· 284 dl
284 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.