Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
Eungyeup Kim, Chenchen Gu, Vashisth Tiwari, J. Zico Kolter

TL;DR
This paper introduces a sample-efficient method for evaluating extremely high reliability (five-nines) in large language models by focusing on failure-prone inputs, significantly reducing inference requirements.
Contribution
The authors propose a failure-focused sampling approach using the cross-entropy method to efficiently estimate rare failure probabilities in LLMs.
Findings
Achieved up to 156.22x reduction in inferences needed for failure estimation.
Models with similar benchmark accuracy can have vastly different failure rates.
The framework enables reliable evaluation of LLMs for safety-critical applications.
Abstract
While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., "five-nines" (99.999%) vs. "three-nines" (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications. Still, estimating such a rare failure probability with tight confidence bounds requires prohibitively large LLM inference sizes, making standard Monte Carlo evaluation infeasible under limited compute budgets. In this paper, we observe that LLM failures exhibit strong systematic patterns: across broad parameterized input spaces, a small subset of inputs disproportionately accounts for the majority of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
