$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang

TL;DR
This paper introduces $C^3$-Bench, a comprehensive benchmark designed to evaluate large language model-based agents' robustness in multi-tasking environments, focusing on complex tool interactions, hidden information, and dynamic decision-making.
Contribution
The paper presents a new benchmark, $C^3$-Bench, with challenges and metrics to assess and analyze the robustness of LLM-based agents in complex, multi-tasking scenarios.
Findings
Agents struggle with tool dependency management
Long context handling is a significant challenge
Frequent policy switching indicates robustness issues
Abstract
Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark -Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
