Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs
Chenyang Yang, Rishabh Rustogi, Rachel Brower-Sinning, Grace A. Lewis,, Christian K\"astner, Tongshuang Wu

TL;DR
Weaver is an interactive tool leveraging large language models to generate knowledge bases and guide model testing, helping testers identify diverse and nuanced test cases beyond their biases.
Contribution
We introduce Weaver, a novel interactive system that uses LLMs to support requirements elicitation and diversify model testing through knowledge base generation.
Findings
Testers identified more and more diverse concepts with Weaver.
Over 200 failing test cases found for stance detection using zero-shot ChatGPT.
Weaver aids real-world model testing in various application scenarios.
Abstract
Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
