TAI3: Testing Agent Integrity in Interpreting User Intent

Shiwei Feng; Xiangzhe Xu; Xuan Chen; Kaiyuan Zhang; Syed Yusuf Ahmed; Zian Su; Mingwei Zheng; Xiangyu Zhang

arXiv:2506.07524·cs.SE·October 27, 2025

TAI3: Testing Agent Integrity in Interpreting User Intent

Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang

PDF

TL;DR

TAI3 is a novel API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents by generating and mutating realistic tasks based on toolkit documentation.

Contribution

It introduces a semantic partitioning and datatype-aware strategy memory to improve the efficiency and effectiveness of testing LLM agents for intent violations.

Findings

01

Effectively uncovers intent violations in 80 toolkit APIs.

02

Outperforms baselines in error detection rate and query efficiency.

03

Generalizes well to different models and evolving APIs.

Abstract

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.