A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback
Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng

TL;DR
This paper introduces MultiCodeIF, a detailed benchmark for evaluating large language models' ability to follow complex, multi-layered programming instructions with iterative feedback, highlighting current performance gaps and improvement potential.
Contribution
It presents a comprehensive, evolvable benchmark with a structured taxonomy, automated task synthesis, and multi-turn evaluation to assess instruction-following in code generation.
Findings
Top model achieves 63.0% constraint satisfaction.
Performance drops significantly with multiple hierarchical constraints.
Structured feedback improves success rates from 63.0% to 83.4%.
Abstract
Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
