When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Richard J. Young; Brandon Gillins; Alice M. Matthews

arXiv:2510.18892·cs.CL·October 23, 2025

When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Richard J. Young, Brandon Gillins, Alice M. Matthews

PDF

Open Access 7 Models 2 Datasets

TL;DR

This paper introduces a practical, efficient evaluation framework for assessing instruction-following in 256 large language models, revealing common failure modes and providing a comprehensive empirical analysis of their capabilities.

Contribution

It presents a novel, streamlined testing approach using twenty prompts to diagnose instruction adherence across diverse models, improving over resource-intensive benchmarks.

Findings

01

Identified prevalent failure modes in instruction following

02

Compared performance across major and emerging LLM providers

03

Provided insights into specific instruction challenges

Abstract

Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Text Readability and Simplification