Bayesian Evaluation of Large Language Model Behavior

Rachel Longjohn; Shang Wu; Saatvik Kher; Catarina Bel\'em; Padhraic Smyth

arXiv:2511.10661·cs.CL·November 17, 2025

Bayesian Evaluation of Large Language Model Behavior

Rachel Longjohn, Shang Wu, Saatvik Kher, Catarina Bel\'em, Padhraic Smyth

PDF

Open Access

TL;DR

This paper introduces a Bayesian method to quantify uncertainty in evaluating large language models' behavior, addressing limitations of traditional binary assessment approaches and providing more nuanced insights.

Contribution

It presents a Bayesian framework for uncertainty quantification in LLM evaluation metrics, with case studies on harmful response refusal rates and preference comparisons.

Findings

01

Bayesian approach effectively quantifies uncertainty in LLM evaluations.

02

Uncertainty estimates improve understanding of model behavior.

03

Method applied successfully to adversarial and preference benchmarks.

Abstract

It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques