ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky

TL;DR
ReliableEval introduces a stochastic evaluation method for LLMs that accounts for prompt sensitivity, providing more robust and meaningful performance assessments across models, tasks, and metrics.
Contribution
The paper proposes a formal framework and a practical method for stochastic LLM evaluation that improves reliability by considering prompt variability.
Findings
Top models like GPT-4o and Claude-3.7-Sonnet show significant prompt sensitivity.
ReliableEval estimates the number of prompt resamplings needed for stable evaluation.
The approach is model-, task-, and metric-agnostic, enhancing evaluation robustness.
Abstract
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
