Toward Functional and Non-Functional Evaluation of Application-Level Code Generation

Ruwei Pan; Yakun Zhang; Qingyuan Liang; Yueheng Zhu; Chao Liu; Lu Zhang; Hongyu Zhang

arXiv:2602.03462·cs.SE·March 31, 2026

Toward Functional and Non-Functional Evaluation of Application-Level Code Generation

Ruwei Pan, Yakun Zhang, Qingyuan Liang, Yueheng Zhu, Chao Liu, Lu Zhang, Hongyu Zhang

PDF

TL;DR

This paper introduces RAL-Bench, a comprehensive benchmark for evaluating large language models on application-level code generation, assessing both functional correctness and non-functional quality attributes in a unified framework.

Contribution

The paper presents RAL-Bench, a novel evaluation framework that measures LLM performance on end-to-end repository generation, including multi-file structure, dependencies, and quality attributes.

Findings

01

No model exceeds 45% functional correctness score.

02

Functional correctness is the main bottleneck in application-level code generation.

03

Non-functional quality attributes are also challenging for current models.

Abstract

Large language models (LLMs) have achieved strong performance on code generation. However, most prior evaluations focus on snippet-level outputs, such as function generation or repository completion. These settings do not fully evaluate application-level code generation, where the goal is to produce a runnable repository with coherent multi-file structure, dependency support, and end-to-end executability. In addition, real-world software quality depends not only on functional correctness but also on non-functional quality attributes, such as maintainability and security. In this paper, we present RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, RAL-Bench derives a concise natural-language requirement from a high-quality reference project, constructs black-box system tests for both functional correctness and non-functional quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.