Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Han Bao; Zheyuan Zhang; Pengcheng Jing; Zhengqing Yuan; Kaiwen Shi; Yanfang Ye

arXiv:2602.02455·cs.AI·February 3, 2026

Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, Yanfang Ye

PDF

Open Access

TL;DR

Drift-Bench is a new benchmark for evaluating how large language model agents handle input faults through multi-turn clarification, revealing significant performance drops and aiding in diagnosing safety-critical failures.

Contribution

It introduces the first diagnostic benchmark for multi-turn clarification under input faults in LLM agents, grounded in communication theory and employing a persona-driven user simulator.

Findings

01

Performance drops under input faults are substantial.

02

Clarification effectiveness varies across user personas and fault types.

03

The benchmark bridges clarification research and agent safety evaluation.

Abstract

As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Topic Modeling · Ethics and Social Impacts of AI