A Comprehensive Study on Large Language Models for Mutation Testing
Bo Wang, Mingda Chen, Ming Deng, Youfang Lin, Mark Harman, Mike Papadakis, Jie M. Zhang

TL;DR
This study empirically evaluates large language models for mutation testing, showing they generate more effective and diverse mutants than rule-based methods but with higher non-compilability and duplication rates.
Contribution
It provides the first comprehensive empirical comparison of LLMs for mutation testing, establishing a baseline for effectiveness and highlighting trade-offs in mutant quality.
Findings
LLMs produce more diverse mutants closer to real bugs.
LLMs achieve 111.29% higher fault detection than rule-based methods.
Mutants from LLMs have higher non-compilability and duplication rates.
Abstract
Large Language Models (LLMs) have recently been used to generate mutants in both research work and in industrial practice. However, there has been no comprehensive empirical study of their performance for this increasingly important LLM-based Software Engineering application. To address this, we conduct a comprehensive empirical study evaluating BugFarm and LLMorpheus (the two state-of-the-art LLM-based approaches), alongside seven LLMs using our newly designed prompt, including both leading open- and closed-source models, on 851 real bugs from two Java real-world bug benchmarks. Our results reveal that, compared to existing rule-based approaches, LLMs generate more diverse mutants, that are behaviorally closer to real bugs and, most importantly, with 111.29% higher fault detection. That is, 87.98% (for LLMs) vs. 41.64% (for rule-based); an increase of 46.34 percentage points.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
