TL;DR
This paper introduces an importance sampling method to efficiently estimate the probability of harmful outputs in language models, addressing tail risks in safety evaluations with fewer samples.
Contribution
It presents a novel importance sampling approach using unsafe models to accurately estimate rare harmful outputs, improving safety assessment efficiency.
Findings
Estimates match brute-force Monte Carlo results with 10-20x fewer samples.
Can estimate probabilities as low as 10^-4 with only 500 samples.
Reveals model sensitivity to input perturbations and deployment risks.
Abstract
Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
