DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian; Yining Ye; Yujia Qin; Xin Cong; Yankai Lin; Yinxu Pan,; Yesai Wu; Haotian Hui; Weichuan Liu; Zhiyuan Liu; Maosong Sun

arXiv:2401.04621·cs.SE·June 7, 2024·2 cites

DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan,, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces DebugBench, a comprehensive benchmark with over 4,200 instances to evaluate the debugging capabilities of large language models across multiple languages and bug types, revealing performance gaps and influencing future development.

Contribution

The paper presents DebugBench, a new large-scale, multi-language debugging benchmark constructed with rigorous quality checks, addressing previous limitations in evaluating LLM debugging abilities.

Findings

01

Open-source models perform worse than humans in debugging.

02

Bug complexity varies significantly by category.

03

Runtime feedback impacts debugging performance variably.

Abstract

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/debugbench
noneOfficial

Datasets

Rtian/DebugBench
dataset· 623 dl
623 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Layer Normalization · Residual Connection · Absolute Position Encodings · Dropout · Dense Connections