The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
Avinash Baidya, Kamalika Das, Xiang Gao

TL;DR
This paper investigates the behavioral differences between LLM-based agents and humans in complex task-oriented dialogs, revealing that reducing these gaps significantly improves agent performance, especially as task complexity increases.
Contribution
It introduces a comprehensive evaluation framework to quantify the behavior gap in LLM agents and demonstrates its impact on performance in complex dialogs.
Findings
Behavior gap correlates strongly with task complexity (0.963)
Low alignment scores for dialog acts and tool usage in complex tasks
Performance improves by 24.3% when behavior gaps are reduced
Abstract
Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multi-Agent Systems and Negotiation
