ABTest: Behavior-Driven Testing for AI Coding Agents

Wuyang Dai; Moses Openja; Hung Viet Pham; Gias Uddin; Jinqiu Yang; Song Wang

arXiv:2604.03362·cs.SE·April 23, 2026

ABTest: Behavior-Driven Testing for AI Coding Agents

Wuyang Dai, Moses Openja, Hung Viet Pham, Gias Uddin, Jinqiu Yang, Song Wang

PDF

TL;DR

ABTest is a behavior-driven fuzzing framework that systematically tests AI coding agents using real-world failure reports to uncover robustness issues and new failure modes.

Contribution

We introduce ABTest, a novel framework that transforms user-reported failures into behavioral tests to evaluate and improve AI coding agents' robustness.

Findings

01

ABTest generated 647 fuzzing cases from 400 failure reports.

02

Detected 1,573 behavioral anomalies across three AI coding agents.

03

Confirmed 642 new true anomalies, with a detection precision of 40.8%.

Abstract

AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.