ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior; Eliya Habba; Shahar Levy; Avi Caciularu; Gabriel Stanovsky

arXiv:2505.22169·cs.CL·September 16, 2025

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky

PDF

Open Access 2 Datasets 1 Video

TL;DR

ReliableEval introduces a stochastic evaluation method for LLMs that accounts for prompt sensitivity, providing more robust and meaningful performance assessments across models, tasks, and metrics.

Contribution

The paper proposes a formal framework and a practical method for stochastic LLM evaluation that improves reliability by considering prompt variability.

Findings

01

Top models like GPT-4o and Claude-3.7-Sonnet show significant prompt sensitivity.

02

ReliableEval estimates the number of prompt resamplings needed for stable evaluation.

03

The approach is model-, task-, and metric-agnostic, enhancing evaluation robustness.

Abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms