SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan, Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu and, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei, Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang

TL;DR
This paper introduces SWE-bench-java, a new benchmark dataset for evaluating large language models' ability to resolve GitHub issues in Java, expanding the existing Python-focused benchmark to support multilingual programming tasks.
Contribution
The paper presents the first Java version of SWE-bench, including dataset, evaluation environment, and leaderboard, enabling multilingual assessment of issue resolving capabilities.
Findings
Implemented SWE-agent as a baseline method
Tested several powerful LLMs on SWE-bench-java
Dataset and tools are publicly available for community use
Abstract
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Software System Performance and Reliability
MethodsSoftmax · Attention Is All You Need
