MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, and Tsong Yueh Chen

TL;DR
MORTAR introduces an automated multi-turn metamorphic testing approach for LLM-based dialogue systems, effectively revealing more diverse and higher-quality bugs compared to single-turn methods, addressing the challenge of the multi-turn oracle problem.
Contribution
It formalizes multi-turn dialogue testing with automated question-answer dialogue generation and metamorphic relations, improving bug detection effectiveness without LLM judges.
Findings
Revealed over 150% more bugs than single-turn baseline.
Detected higher-quality bugs in diversity, precision, and uniqueness.
Effective in testing six popular LLM-based dialogue systems.
Abstract
With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multi-Agent Systems and Negotiation
