A Real-World Benchmark for Evaluating Fine-Grained Issue Solving   Capabilities of Large Language Models

Ruida Hu; Chao Peng; Jingyi Ren; Bo Jiang; Xiangxin Meng; Qinyun Wu,; Pengfei Gao; Xinchen Wang; Cuiyun Gao

arXiv:2411.18019·cs.SE·November 28, 2024

A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models

Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu,, Pengfei Gao, Xinchen Wang, Cuiyun Gao

PDF

Open Access

TL;DR

FAUN-Eval is a new benchmark designed to evaluate the fine-grained issue solving capabilities of large language models across tasks like QA, fault localization, and code editing, using a curated dataset from GitHub.

Contribution

The paper introduces FAUN-Eval, a comprehensive benchmark that assesses LLMs' performance on specific software engineering subtasks, addressing limitations of previous benchmarks.

Findings

01

Top-performing LLMs vary across tasks

02

Issue features can mislead LLMs

03

Model performance depends on text length

Abstract

Automatically resolving software issues is crucial for software development in practice, impacting the software quality and user experience. The process of resolving real-world issues encompasses tasks such as question-answering (QA), fault localization, and code editing. Existing benchmarks such as HumanEval fall short in their ability to assess LLMs' proficiency in solving issues within a codebase. Although benchmarks like SWE-Bench are designed to evaluate the LLMs' capability to handle real-world GitHub issues, the end-to-end evaluation method cannot provide granular insights on the performance of subtasks involved in issue solving. To address existing deficiencies in benchmarking LLMs for practical software engineering tasks, we introduce FAUN-Eval, a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. FAUN-Eval systematically assesses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling