Automated structural testing of LLM-based agents: methods, framework, and case studies

Jens Kohl; Otto Kruse; Youssef Mostafa; Andre Luckow; Karsten Schroer; Thomas Riedl; Ryan French; David Katz; Manuel P. Luitz; Tanrajbir Takher; Ken E. Friedl; C\'eline Laurent-Winter

arXiv:2601.18827·cs.SE·January 28, 2026

Automated structural testing of LLM-based agents: methods, framework, and case studies

Jens Kohl, Otto Kruse, Youssef Mostafa, Andre Luckow, Karsten Schroer, Thomas Riedl, Ryan French, David Katz, Manuel P. Luitz, Tanrajbir Takher, Ken E. Friedl, C\'eline Laurent-Winter

PDF

Open Access

TL;DR

This paper introduces a structural testing framework for LLM-based agents using traces, mocking, and assertions, enabling automated, in-depth testing and faster root-cause analysis to improve agent quality.

Contribution

It presents novel methods for automated structural testing of LLM agents, integrating software engineering practices into AI agent testing workflows.

Findings

01

Automated testing reduces costs and increases coverage.

02

Faster root-cause analysis with trace-based debugging.

03

Enhanced defect detection earlier in development.

Abstract

LLM-based agents are rapidly being adopted across diverse domains. Since they interact with users without supervision, they must be tested extensively. Current testing approaches focus on acceptance-level evaluation from the user's perspective. While intuitive, these tests require manual evaluation, are difficult to automate, do not facilitate root cause analysis, and incur expensive test environments. In this paper, we present methods to enable structural testing of LLM-based agents. Our approach utilizes traces (based on OpenTelemetry) to capture agent trajectories, employs mocking to enforce reproducible LLM behavior, and adds assertions to automate test verification. This enables testing agent components and interactions at a deeper technical level within automated workflows. We demonstrate how structural testing enables the adaptation of software engineering best practices to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Engineering Techniques and Practices