Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Tzafrir Rehan

arXiv:2603.08806·cs.SE·March 11, 2026

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Tzafrir Rehan

PDF

Open Access 1 Datasets

TL;DR

TDAD introduces a test-driven methodology for defining and refining tool-using AI agents through behavioral specifications, improving compliance, detecting regressions, and ensuring safety in deployment.

Contribution

It presents a novel approach combining behavioral specifications with automated testing and mutation analysis to enhance AI agent reliability and safety.

Findings

01

Achieved 92% compilation success rate on benchmark agents.

02

Demonstrated high mutation detection scores (86-100%).

03

Showed improved regression safety with 97% safety scores.

Abstract

We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

f-labs-io/SpecSuite-Core
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI