Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

Alberto Purpura; Li Wang; Sahil Badyal; Eugenio Beaufrand; Adam Faulkner

arXiv:2601.18554·cs.AI·January 27, 2026

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

Alberto Purpura, Li Wang, Sahil Badyal, Eugenio Beaufrand, Adam Faulkner

PDF

Open Access 1 Video

TL;DR

This paper introduces MOSAIC, a modular benchmark for detailed evaluation of LLMs' ability to follow complex instructions, revealing variability in compliance across different constraints and model types.

Contribution

The paper presents MOSAIC, a novel framework for granular, independent assessment of instruction compliance in LLMs, addressing limitations of existing benchmarks.

Findings

01

Compliance varies with constraint type, quantity, and position.

02

Model-specific weaknesses and biases such as primacy and recency effects.

03

Interactions between instructions can be synergistic or conflicting.

Abstract

Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning