RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Jia Li; Hongyi Deng; Yiran Zhang; Kechi Zhang; Tianqi Shao; Tiankuo Zhao; Weinan Wang; Zhi Jin; Ge Li; Yang Liu; Yingtao Fang; and Yihong Dong

arXiv:2604.22659·cs.SE·April 27, 2026

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, and Yihong Dong

PDF

TL;DR

RealBench introduces a benchmark for repo-level code generation that uses structured system designs like UML diagrams, better reflecting real-world software development practices and evaluating LLMs' capabilities in this context.

Contribution

This paper presents a new benchmark aligned with industry practices, incorporating structured designs, and systematically evaluates LLMs' performance in repo-level code generation.

Findings

01

LLMs perform worse at repo-level code generation compared to module-level tasks.

02

LLMs excel at creating modules from UML diagrams but often produce poor quality code.

03

Generating entire repositories at once is effective for small projects, while module-by-module works better for complex repositories.

Abstract

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.