IFEvalCode: Controlled Code Generation

Jian Yang; Wei Zhang; Shukai Liu; Linzheng Chai; Yingshui Tan; Jiaheng Liu; Ge Zhang; Wangchunshu Zhou; Guanglin Niu; Zhoujun Li; Binyuan Hui; Junyang Lin

arXiv:2507.22462·cs.CL·August 4, 2025

IFEvalCode: Controlled Code Generation

Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin

PDF

TL;DR

This paper introduces a new benchmark, IFEvalCode, and methods for controlled code generation, emphasizing adherence to detailed instructions and evaluating models' instruction-following capabilities across multiple programming languages.

Contribution

It proposes forward and backward constraints techniques to enhance instruction-following in code LLMs and introduces IFEvalCode, a multilingual benchmark with separate correctness and instruction-following metrics.

Findings

01

Closed-source models outperform open-source in controlled code generation.

02

Significant gap between code correctness and instruction adherence in models.

03

IFEvalCode enables nuanced evaluation of instruction-following in multilingual code generation.

Abstract

Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.