HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning
Yujian Liu, Jiabao Ji, Yang Zhang, Wenbo Guo, Tommi Jaakkola, Shiyu Chang

TL;DR
HarnessLLM introduces a reinforcement learning-based approach for automatic test harness generation, enabling more diverse and complex testing of programs with improved bug detection and validation capabilities.
Contribution
It presents a novel two-stage training pipeline combining supervised fine-tuning and reinforcement learning for generating test harnesses with complex validation.
Findings
Outperforms input-output-based testing in bug detection.
Increases testing strategy diversity.
Enhances code generation performance through test-time scaling.
Abstract
Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper works on test case generation, a very important issue for automatic code generation. The proposed approach that uses an automatic validator to validate test outputs is interesting. The trained model is demonstrated to work well for the benchmarks selected.
1. Potential bias in invariant learning and distribution analysis The paper assumes that invariants can be effectively extracted or generated from real code, which might be true for competitive programs. But in practice, real-world programs often lack explicitly stated or easily derivable invariants. As a result, the reported distributional analysis of invariants could be biased toward competitive code or simplified code snippets, limiting the generalizability of the findings to realistic softw
1. The proposed framework advances beyond the conventional comparison between the expected output and the target program’s output. By incorporating invariant checking and brute-force reference implementations, it enables more fine-grained and reliable validation of program behavior. 2. The proposed framework demonstrates that a smaller model such as Qwen3-4B, when fine-tuned on outputs from the teacher model and further optimized with the proposed verifiable-reward-based RL, can even surpass th
1. The experimental comparison remains narrow. Most results are derived from Qwen3-4B as the student model and Qwen3-32B as the sole teacher model, with only a supplementary experiment on LLaMA3.2-3B presented in the appendix. This limited setup makes it unclear whether the proposed approach generalizes to different model scales, architectures, or teacher–student combinations. 2. The experiment compares the proposed method only against UTGen and simple input–output testing. It would be helpful
The paper targets an important problem (test generation)
The paper shapes the test generation problem in a simple and less practical scenario. In particular, the test generation scenario should assume there is a problem statement (which serves as the specification for the correct code). However, such an assumption is not always held, actually there are no problem statements or natural language-described specification for real-world software systems at most cases. Therefore, the application scenario of the proposed approach is narrow and not realistic.
1.The writting is good and easy to follow 2.The paper presents a clear motivation
1.The proposed method appears heavily dependent on the base model's inherent capabilities, and given that relatively small models (e.g., Qwen3-4B) were selected for experimentation, it remains unclear whether the approach would yield comparable performance improvements when applied to larger, more capable models. 2.The experimental evaluation is insufficiently comprehensive: the comparison with larger models is limited, the evaluation datasets contain relatively small sample sizes that raise que
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · VLSI and Analog Circuit Testing · Machine Learning and Algorithms
