HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

Lennart Luettgau; Harry Coppock; Magda Dubois; Christopher Summerfield; Cozmin Ududec

arXiv:2505.05602·cs.AI·July 15, 2025

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

Lennart Luettgau, Harry Coppock, Magda Dubois, Christopher Summerfield, Cozmin Ududec

PDF

Open Access 1 Repo

TL;DR

HiBayES is a hierarchical Bayesian framework designed for robustly evaluating AI systems, especially effective in low-data scenarios, providing principled uncertainty quantification and adaptable to complex, nested evaluation structures.

Contribution

It introduces a generalizable hierarchical Bayesian modeling framework for AI evaluation, supporting robust inference and uncertainty quantification in complex, low-data evaluation settings.

Findings

01

Supports robust inferences in classical and advanced AI benchmarks

02

Effective in low-data scenarios with fewer than 20 data points per evaluation

03

Provides a software package for practical implementation

Abstract

As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukgovernmentbeis/hibayes
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)