Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
Wenbo Zhang, Hengrui Cai, Wenyu Chen

TL;DR
This paper introduces a hierarchical statistical model that leverages multiple generations from large language models to improve benchmark evaluation accuracy, reduce variance, and provide detailed prompt-level insights.
Contribution
It presents a novel statistical framework that accounts for LLM randomness and enhances benchmark reliability through multiple generations and detailed prompt analysis.
Findings
Using multiple generations improves benchmark score estimation accuracy.
The model enables prompt-level difficulty scoring based on correct ratios.
A data map visualizes prompt difficulty and semantics for error detection.
Abstract
Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
