Risk Aware Benchmarking of Large Language Models

Apoorva Nitsure; Youssef Mroueh; Mattia Rigotti; Kristjan Greenewald,; Brian Belgodere; Mikhail Yurochkin; Jiri Navratil; Igor Melnyk; and Jerret; Ross

arXiv:2310.07132·cs.LG·June 11, 2024

Risk Aware Benchmarking of Large Language Models

Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald,, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, and Jerret, Ross

PDF

Open Access

TL;DR

This paper introduces a statistical framework for benchmarking large language models by assessing socio-technical risks using stochastic dominance tests, enabling risk-aware model selection with quantifiable significance.

Contribution

It develops a novel distributional benchmarking method based on stochastic dominance, linking risk assessment to econometric and financial models for the first time in LLM evaluation.

Findings

01

Effective risk comparison of LLMs regarding instruction drift and toxicity.

02

Statistically significant differences identified among models.

03

Framework validated through theoretical analysis and empirical experiments.

Abstract

We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Risk and Portfolio Optimization · Statistical Methods and Inference