SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Aditya Bharat Soni; Rajat Ghosh; Vaishnavi Bhargava; Valerie Chen; Debojyoti Dutta

arXiv:2601.13713·cs.SE·January 21, 2026

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Aditya Bharat Soni, Rajat Ghosh, Vaishnavi Bhargava, Valerie Chen, Debojyoti Dutta

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SWE-Tester, a pipeline for training open-source LLMs to generate issue reproduction tests, improving software testing and automated issue resolution by leveraging a large curated dataset.

Contribution

We develop a novel training pipeline for open-source LLMs using a large dataset, demonstrating significant performance improvements in issue test generation.

Findings

01

Up to 10% success rate improvement on SWT-Bench Verified.

02

Up to 21% increase in change coverage.

03

Consistent gains with larger models and more compute.

Abstract

Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root cause analysis, promotes test-driven development -- "test first, write code later", and can be used for improving the effectiveness of automated issue resolution systems like coding agents. Existing methods proposed for this task predominantly rely on closed-source LLMs, with limited exploration of open models. To address this, we propose SWE-Tester -- a novel pipeline for training open-source LLMs to generate issue reproduction tests. First, we curate a high-quality training dataset of 41K instances from 2.6K open-source GitHub repositories and use it to train LLMs of varying sizes and families. The fine-tuned models achieve absolute improvements of up…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Significant Performance Gains: The study reports substantial performance improvements, demonstrating the significant potential of open-source LLMs to effectively address real-world software engineering benchmarks. 2. Solid & Reproducible Foundation: The research is grounded in a well-curated and reproducible dataset of 41,000 issue-test pairs, establishing a solid foundation for future studies in open-source software engineering. 3. Transparent & Simple Workflow: The paper introduces a stra

Weaknesses

* Superficial Performance Gains: The reported improvements are primarily driven by a brute-force approach of sampling and reranking multiple patches, rather than by genuine advancements in the model's reasoning capabilities. This reliance on test-time scaling may inflate benchmark scores but fails to address the core challenge of autonomous issue comprehension and causal reasoning in code. * Limited and Inflexible Architecture: The framework is fundamentally a static, two-step pipeline, devoid

Reviewer 02Rating 4Confidence 5

Strengths

- The authors evaluate multiple open models of different sizes and families, analyze scaling effects in both training data and inference-time compute, and offer detailed quantitative insights. - The dataset of 41K issue–test pairs is well-filtered and reproducible, providing a strong foundation for open-source SWE research. - The workflow is simple and interpretable, with carefully described steps for localization, editing, and evaluation. - The reported gains show that open-source LLMs can m

Weaknesses

- The proposed framework is purely a static two-step pipeline—there is no reasoning loop, reflection, or autonomous planning. As the community rapidly transitions toward agentic SWE systems, this direction feels inherently limited and non-scalable. It lacks the ability to generalize beyond the fixed workflow or adapt dynamically to complex issue contexts. - The performance improvements are largely achieved through sampling multiple patches and reranking rather than stronger modeling or reasonin

Reviewer 03Rating 2Confidence 4

Strengths

- Addresses an important problem in software engineering—**bug reproduction**—and improves LLM performance on this task through targeted training. - Conducts training across multiple models and provides detailed analyses of experimental results.

Weaknesses

- The data construction method and reproduction pipeline are largely adapted from well-established approaches in the issue resolution literature; the work mainly applies these existing methods to the issue reproduction task, which limits its methodological novelty for a top-tier conference like ICLR. - Focuses solely on the “edit exactly one test file” scenario, which may hurt generalizability. - Lacks appropriate baselines. Although few prior works explicitly target issue reproduction, many **c

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices