PredictaBoard: Benchmarking LLM Score Predictability

Lorenzo Pacchiardi; Konstantinos Voudouris; Ben Slater; Fernando Mart\'inez-Plumed; Jos\'e Hern\'andez-Orallo; Lexin Zhou; Wout Schellaert

arXiv:2502.14445·cs.CL·June 18, 2025

PredictaBoard: Benchmarking LLM Score Predictability

Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Mart\'inez-Plumed, Jos\'e Hern\'andez-Orallo, Lexin Zhou, Wout Schellaert

PDF

Open Access 1 Repo

TL;DR

PredictaBoard introduces a benchmarking framework to evaluate how well assessors can predict LLM errors, aiming to improve the safety and reliability of large language models by focusing on their predictability.

Contribution

This paper presents a novel collaborative benchmarking framework, PredictaBoard, for evaluating the ability of score predictors to anticipate LLM errors on specific tasks.

Findings

01

Baseline assessors show varying prediction accuracy

02

PredictaBoard reveals the importance of predictability in LLM safety

03

Framework encourages development of more reliable LLM assessors

Abstract

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Kinds-of-Intelligence-CFI/PredictaBoard
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques