UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang

TL;DR
This paper introduces UTBoost, a framework that enhances code evaluation benchmarks by automatically generating additional test cases using LLMs, revealing inaccuracies in existing code generation assessments and improving the reliability of model rankings.
Contribution
We present UTGenerator and UTBoost, novel tools for automatic test case augmentation that improve the evaluation of coding agents on real-world benchmarks.
Findings
Identified 36 tasks with insufficient test cases
Uncovered 345 erroneous patches passing original tests
Caused 18 and 11 ranking changes in benchmark leaderboards
Abstract
The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques
