A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models
Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu,, Pengfei Gao, Xinchen Wang, Cuiyun Gao

TL;DR
FAUN-Eval is a new benchmark designed to evaluate the fine-grained issue solving capabilities of large language models across tasks like QA, fault localization, and code editing, using a curated dataset from GitHub.
Contribution
The paper introduces FAUN-Eval, a comprehensive benchmark that assesses LLMs' performance on specific software engineering subtasks, addressing limitations of previous benchmarks.
Findings
Top-performing LLMs vary across tasks
Issue features can mislead LLMs
Model performance depends on text length
Abstract
Automatically resolving software issues is crucial for software development in practice, impacting the software quality and user experience. The process of resolving real-world issues encompasses tasks such as question-answering (QA), fault localization, and code editing. Existing benchmarks such as HumanEval fall short in their ability to assess LLMs' proficiency in solving issues within a codebase. Although benchmarks like SWE-Bench are designed to evaluate the LLMs' capability to handle real-world GitHub issues, the end-to-end evaluation method cannot provide granular insights on the performance of subtasks involved in issue solving. To address existing deficiencies in benchmarking LLMs for practical software engineering tasks, we introduce FAUN-Eval, a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. FAUN-Eval systematically assesses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
