The Instruction Gap: LLMs get lost in Following Instruction
Vishesh Tripathi, Uday Allu, Biddwan Ahmed

TL;DR
This paper evaluates 13 leading LLMs revealing significant variability in instruction adherence, highlighting a critical 'instruction gap' that impacts enterprise deployment and providing benchmarks for future improvements.
Contribution
It systematically assesses instruction compliance across models, identifying the extent of the instruction gap and establishing benchmarks for enterprise-ready LLM performance.
Findings
Claude-Sonnet-4 and GPT-5 perform best in instruction following
Instruction adherence varies dramatically across models
The instruction gap poses a challenge for enterprise deployment
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the "instruction gap" - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Materials Science
