The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Avinash Baidya; Kamalika Das; Xiang Gao

arXiv:2506.12266·cs.CL·June 17, 2025

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Avinash Baidya, Kamalika Das, Xiang Gao

PDF

Open Access 1 Video

TL;DR

This paper investigates the behavioral differences between LLM-based agents and humans in complex task-oriented dialogs, revealing that reducing these gaps significantly improves agent performance, especially as task complexity increases.

Contribution

It introduces a comprehensive evaluation framework to quantify the behavior gap in LLM agents and demonstrates its impact on performance in complex dialogs.

Findings

01

Behavior gap correlates strongly with task complexity (0.963)

02

Low alignment scores for dialog acts and tool usage in complex tasks

03

Performance improves by 24.3% when behavior gaps are reduced

Abstract

Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs· underline

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation