CIFE: Code Instruction-Following Evaluation

Sravani Gunnu; Shanmukha Guttula; Hima Patel

arXiv:2512.17387·cs.SE·December 22, 2025

CIFE: Code Instruction-Following Evaluation

Sravani Gunnu, Shanmukha Guttula, Hima Patel

PDF

Open Access

TL;DR

This paper introduces CIFE, a comprehensive benchmark with 1,000 Python tasks and developer constraints, to evaluate how well language models generate code that is both correct and adheres to explicit requirements.

Contribution

The paper presents a new benchmark and metrics for assessing code generation models on correctness and constraint adherence, highlighting the gap in strict compliance.

Findings

01

Strong models achieve over 90% partial adherence

02

Strict adherence ranges from 39% to 66%

03

Significant gap between partial and strict constraint satisfaction

Abstract

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Adversarial Robustness in Machine Learning