When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs
Richard J. Young, Brandon Gillins, Alice M. Matthews

TL;DR
This paper introduces a practical, efficient evaluation framework for assessing instruction-following in 256 large language models, revealing common failure modes and providing a comprehensive empirical analysis of their capabilities.
Contribution
It presents a novel, streamlined testing approach using twenty prompts to diagnose instruction adherence across diverse models, improving over resource-intensive benchmarks.
Findings
Identified prevalent failure modes in instruction following
Compared performance across major and emerging LLM providers
Provided insights into specific instruction challenges
Abstract
Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗richardyoung/llm-instruction-following-codemodel
- 🤗richardyoung/llm-instruction-following-papermodel
- 🤗richardyoung/olmOCR-2-7B-1025-GGUFmodel· 339 dl· ♡ 3339 dl♡ 3
- 🤗richardyoung/Deepseek-R1-Distill-Qwen-32b-uncensoredmodel· 371 dl· ♡ 4371 dl♡ 4
- 🤗richardyoung/Qwen2.5-7B-Instruct-abliterated-GGUFmodel· 7.4k dl7.4k dl
- 🤗richardyoung/Qwen3-14B-abliterated-GGUFmodel· 805 dl805 dl
- 🤗Wlc7758/Deepseek-R1-Distill-Qwen-32b-uncensoredmodel· 818 dl818 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Text Readability and Simplification
