TL;DR
This paper evaluates the nuance-oriented reliability of large language models, revealing significant performance drops with subtle prompt changes and proposing new metrics, benchmarks, and improvement strategies.
Contribution
It introduces reliable@k metric, develops IFEval++, and provides a systematic evaluation of LLMs' nuance-oriented reliability, highlighting a key area for future improvement.
Findings
Models' performance drops up to 61.8% with nuanced prompts.
Current models show substantial reliability insufficiency.
Code and benchmark available at GitHub.
Abstract
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
