ABTest: Behavior-Driven Testing for AI Coding Agents
Wuyang Dai, Moses Openja, Hung Viet Pham, Gias Uddin, Jinqiu Yang, Song Wang

TL;DR
ABTest is a behavior-driven fuzzing framework that systematically tests AI coding agents using real-world failure reports to uncover robustness issues and new failure modes.
Contribution
We introduce ABTest, a novel framework that transforms user-reported failures into behavioral tests to evaluate and improve AI coding agents' robustness.
Findings
ABTest generated 647 fuzzing cases from 400 failure reports.
Detected 1,573 behavioral anomalies across three AI coding agents.
Confirmed 642 new true anomalies, with a detection precision of 40.8%.
Abstract
AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
