Code Aesthetics with Agentic Reward Feedback
Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei

TL;DR
This paper introduces a new approach to improve the aesthetic quality of code generated by large language models through a specialized dataset, multi-agent aesthetic evaluation, and joint optimization, resulting in significantly better code aesthetics.
Contribution
We propose AesCode-358K dataset, agentic reward feedback system, and the GRPO-AR algorithm to optimize code aesthetics alongside functionality in LLMs.
Findings
AesCoder-4B outperforms GPT-4o and GPT-4.1 in code aesthetics.
Combining supervised fine-tuning with reinforcement learning improves aesthetic quality.
Our approach achieves performance comparable to large open-source models with 480B-685B parameters.
Abstract
Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining…
Peer Reviews
Decision·ICLR 2026 Poster
The problem is well defined, the methodology is mostly sound, and the experiments support the main claims. The reliance on proprietary judges (GPT-5/GPT-4o) for both data curation and evaluation introduces possible bias, which the paper partially mitigates via correlation with Design Arena and human annotations. More analysis on sensitivity to judge choice and reward weights would strengthen soundness. The execution agent uses HTMLHint rules rather than brittle strict parsing; the static aesthe
GPT-5 (and GPT-4o) are used during dataset filtering, for static aesthetics scoring, and within the interactive agent. This may bias the training signal and the benchmark toward these judges’ preferences. While OpenDesign shows strong rank correlation with Design Arena and decent GPT–human agreement, results could be sensitive to the choice of judge. Please report results with at least one strong alternative judge and quantify changes. The paper defines the weighted sum for r and also reports a
- The paper addresses an underexplored but important problem that is the aesthetic quality of code generated by LLMs for visual tasks, which measuring functional correctness is insufficient . - The AesCode358k dataset is a valuable contribution that can benefit future research in aesthetic-aware code generation once released. - The proposed OpenDesign benchmark is also a useful dataset that addresses the limitation of having human voters for evaluation, providing a reproducible way to evaluate
- The same aesthetic agents are used both during reinforcement learning and evaluation, introducing potential circularity and reward overfitting. The model may learn to exploit or mimic the judge model’s biases rather than improving true generalization in visual aesthetics. - While improvements on visual coding tasks are reported, the paper lacks an evaluation of whether the aesthetic alignment impacts general code generation ability (e.g., possible regression on standard code benchmarks such a
**Originality**: This work represents the first systematic effort to formalize and computationally address the concept of "code aesthetics," a dimension of code quality long acknowledged by practitioners but largely neglected in automated code generation research. The proposed multi-agent reward framework is a novel architecture that synergistically combines textual analysis, visual rendering, and interactive evaluation to assess code quality holistically. A particularly creative contribution i
**Reproducibility Concerns** The reproducibility of this study is significantly hampered by its deep dependence on several proprietary, black-box models, specifically GPT-4 and GPT-4V, which serve as the core "judges" in the reward model and are pivotal for the initial dataset generation. This reliance creates a hard barrier for independent verification, as the internal mechanics and future versions of these models are opaque and subject to change, making it impossible to exactly replicate the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
