MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

Jinyang Huang; Xiachong Feng; Qiguang Chen; Hanjie Zhao; Zihui Cheng; Jiesong Bai; Jingxuan Zhou; Min Li; Libo Qin

arXiv:2506.13824·cs.SE·June 18, 2025

MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, Libo Qin

PDF

Open Access 1 Repo

TL;DR

This paper introduces MLDebugging, a comprehensive benchmark for evaluating code debugging in complex multi-library Python scenarios, revealing current LLMs' limitations in such settings.

Contribution

It presents the first benchmark specifically designed for multi-library debugging, covering 126 libraries and seven issue types, and evaluates LLM performance in this challenging context.

Findings

01

Current LLMs struggle with multi-library debugging tasks.

02

MLDebugging reveals significant gaps in LLM capabilities for complex code scenarios.

03

Benchmark provides a new resource for future research in multi-library debugging.

Abstract

Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hjytsuki/mldebugging
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Parallel Computing and Optimization Techniques · Software Testing and Debugging Techniques