CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

Forough Mehralian; Ryan Shar; James R. Rae; Alireza Hashemi

arXiv:2510.27565·cs.SE·November 3, 2025

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CodeAlignBench, a comprehensive multi-language benchmark for evaluating code generation models on developer-aligned code adjustments, emphasizing instruction-following and refinement capabilities beyond functional correctness.

Contribution

We present a new extensible benchmark that assesses instruction-following and refinement in code generation across multiple languages, addressing limitations of existing evaluation methods.

Findings

01

Models show varied performance across languages and tasks.

02

Benchmark reveals strengths and limitations of current code models.

03

Automated evaluation pipeline enables comprehensive analysis.

Abstract

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- The process to collect the instruction catalog involves human developers and not just synthetic generation from LLMs, making it a valuable dataset and benchmark - The motivation of the paper of going beyond functional correctness is well-motivated, as code generation for users is not just about correctness but also about adhering to users' preferences - Evaluates multiple programming languages (Python, Java, JavaScript), allowing for cross-language comparison

Weaknesses

- Since some instructions can involve non-trivial adjustments to the code on a semantic level, it is not clear how accurate LLM-as-judge is for those kinds of instructions - Even though the benchmark creation process involves developers, there is still use of LLMs; it would be better if the benchmark could be vetted by humans to ensure validity - The models used in the experiments are all closed models and no open models are evaluated

Reviewer 02Rating 4Confidence 3

Strengths

- This paper explores an important and interesting research direction.

Weaknesses

- The evaluation lacks detailed analysis and concrete examples about LLMs’ performance on this benchmark. While there is some high-level discussion in Section 4, more in-depth study is required for a better understanding of the instruction-following capabilities of LLMs. - The evaluation of instruction-following capability is largely dependent on LLM-as-a-Judge, which is however not accurate enough. As shown in Section 4, the accuracy of LLM-as-a-Judge is only 86.67%, which limits its practical

Reviewer 03Rating 2Confidence 5

Strengths

- The paper introduces and evaluates two complementary instruction-following settings, predefined and follow-up, that have not been explored before. - The authors source instruction categories from realistic grounding, thereby reducing synthetic bias, and construction of an instruction catalog via human–LLM collaborative coding. - The paper extends LiveBench to Java and JavaScript, enhancing its cross-lingual relevance and applicability. - Quantitative results across three languages and 10 mode

Weaknesses

- The manuscript contains numerous minor typographical and style inconsistencies that detract from polish and readability. Examples include: - Page 1, line 53: the em dash around “CodeAlignBench —a benchmark” has asymmetric spacing (space before the dash, none after). - Page 3, line 140: "Javascript" should be "JavaScript" for consistency with other instances. - Page 5, line 257: the sentence starts with a lowercase letter; initial capitalization is expected. - Algorithm questions from Liv

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Model-Driven Software Engineering Techniques