UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Boxi Yu; Yuxuan Zhu; Pinjia He; Daniel Kang

arXiv:2506.09289·cs.SE·June 12, 2025

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces UTBoost, a framework that enhances code evaluation benchmarks by automatically generating additional test cases using LLMs, revealing inaccuracies in existing code generation assessments and improving the reliability of model rankings.

Contribution

We present UTGenerator and UTBoost, novel tools for automatic test case augmentation that improve the evaluation of coding agents on real-world benchmarks.

Findings

01

Identified 36 tasks with insufficient test cases

02

Uncovered 345 erroneous patches passing original tests

03

Caused 18 and 11 ranking changes in benchmark leaderboards

Abstract

The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cuhk-shenzhen-se/utboost
noneOfficial

Videos

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench· underline

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques