PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
Yunan Lu, Luigi Liu, Omar Yahia, Arpit Sharma, Zhou Yu

TL;DR
PQR is a framework that generates diverse, realistic user queries to effectively identify failures in QA agents, reducing human effort and improving failure detection accuracy.
Contribution
It introduces an iterative query and prompt refinement process to surface realistic failure cases, outperforming prior automatic failure discovery methods.
Findings
Uncovers 23% - 78% more unhelpful responses in QA agents.
Generated queries are more diverse and realistic than previous methods.
Enhances failure detection with less human effort.
Abstract
Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
