MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Haiyang Li

arXiv:2508.02998·cs.SE·August 6, 2025

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Haiyang Li

PDF

TL;DR

MRG-Bench is a new multi-language dataset for evaluating repository-level code generation, addressing previous limitations by including real-world data, multiple languages, and runnable test cases, revealing current models' performance issues and understanding challenges.

Contribution

Introduces MRG-Bench, a comprehensive dataset for multi-language repository-level code generation with practical data and test cases, enabling more accurate evaluation of LLMs.

Findings

01

Current models show significant performance deficiencies.

02

Models struggle with understanding user requirements.

03

Language-specific contextual information impacts model performance.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce \textbf{MRG-Bench} (Multi-language Repository-level Code Generation Benchmark), a novel dataset that provides a more accurate evaluation of LLMs in practical repository-level code generation tasks. MRG-Bench has three main features: (1) practical data sourced from real-world code repositories that align to the practical distribution, (2) multiple programming languages support, including Python, Java, and Go, and (3) project-level runnable test cases to assess the quality of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.