Estimating Tail Risks in Language Model Output Distributions

Rico Angell; Raghav Singhal; Zachary Horvitz; Zhou Yu; Rajesh Ranganath; Kathleen McKeown; He He

arXiv:2604.22167·cs.LG·April 27, 2026

Estimating Tail Risks in Language Model Output Distributions

Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

PDF

1 Repo

TL;DR

This paper introduces an importance sampling method to efficiently estimate the probability of harmful outputs in language models, addressing tail risks in safety evaluations with fewer samples.

Contribution

It presents a novel importance sampling approach using unsafe models to accurately estimate rare harmful outputs, improving safety assessment efficiency.

Findings

01

Estimates match brute-force Monte Carlo results with 10-20x fewer samples.

02

Can estimate probabilities as low as 10^-4 with only 500 samples.

03

Reveals model sensitivity to input perturbations and deployment risks.

Abstract

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rangell/LMTailRisk
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.