Toward Functional and Non-Functional Evaluation of Application-Level Code Generation
Ruwei Pan, Yakun Zhang, Qingyuan Liang, Yueheng Zhu, Chao Liu, Lu Zhang, Hongyu Zhang

TL;DR
This paper introduces RAL-Bench, a comprehensive benchmark for evaluating large language models on application-level code generation, assessing both functional correctness and non-functional quality attributes in a unified framework.
Contribution
The paper presents RAL-Bench, a novel evaluation framework that measures LLM performance on end-to-end repository generation, including multi-file structure, dependencies, and quality attributes.
Findings
No model exceeds 45% functional correctness score.
Functional correctness is the main bottleneck in application-level code generation.
Non-functional quality attributes are also challenging for current models.
Abstract
Large language models (LLMs) have achieved strong performance on code generation. However, most prior evaluations focus on snippet-level outputs, such as function generation or repository completion. These settings do not fully evaluate application-level code generation, where the goal is to produce a runnable repository with coherent multi-file structure, dependency support, and end-to-end executability. In addition, real-world software quality depends not only on functional correctness but also on non-functional quality attributes, such as maintainability and security. In this paper, we present RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, RAL-Bench derives a concise natural-language requirement from a high-quality reference project, constructs black-box system tests for both functional correctness and non-functional quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
