Revisiting the Reliability of Language Models in Instruction-Following

Jianshuo Dong; Yutong Zhang; Yan Liu; Zhenyu Zhong; Tao Wei; Chao Zhang; Han Qiu

arXiv:2512.14754·cs.SE·April 15, 2026

Revisiting the Reliability of Language Models in Instruction-Following

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

PDF

1 Repo

TL;DR

This paper evaluates the nuance-oriented reliability of large language models, revealing significant performance drops with subtle prompt changes and proposing new metrics, benchmarks, and improvement strategies.

Contribution

It introduces reliable@k metric, develops IFEval++, and provides a systematic evaluation of LLMs' nuance-oriented reliability, highlighting a key area for future improvement.

Findings

01

Models' performance drops up to 61.8% with nuanced prompts.

02

Current models show substantial reliability insufficiency.

03

Code and benchmark available at GitHub.

Abstract

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jianshuod/IFEval-pp
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.