What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
Ranit Karmakar, Jayita Chatterjee

TL;DR
This paper reveals that current single-prompt accuracy assessments can be misleading due to evaluation design, fragile confidence signals, and unreliable prompt robustness, emphasizing the need for comprehensive reliability metrics.
Contribution
It provides a multi-variant reliability audit of language models, highlighting how evaluation choices impact perceived model performance and reliability.
Findings
Evaluation design significantly affects reliability conclusions.
Confidence signals are fragile and can misrepresent model certainty.
Prompt robustness does not correlate reliably with model size.
Abstract
Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
