PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Itay Dreyfuss; Antonio Abu Nassar; Samuel Ackerman; Axel Ben David; Eitan Farchi; Rami Katan; Orna Raz; Marcel Zalmanovici

arXiv:2512.10713·cs.SE·December 23, 2025

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Itay Dreyfuss, Antonio Abu Nassar, Samuel Ackerman, Axel Ben David, Eitan Farchi, Rami Katan, Orna Raz, Marcel Zalmanovici

PDF

Open Access

TL;DR

PACIFIC is a framework that automatically creates customizable benchmarks to evaluate LLMs' ability to follow instructions and reason about code without execution, addressing limitations of existing evaluation methods.

Contribution

It introduces a novel, scalable framework for generating diverse, contamination-resistant benchmarks to assess instruction-following and dry-running in LLMs.

Findings

01

PACIFIC can generate benchmarks with varying difficulty levels.

02

It effectively differentiates LLM capabilities across models.

03

Benchmarks show models' instruction-following and reasoning skills.

Abstract

Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Machine Learning in Materials Science