Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano

TL;DR
This paper proposes a survival analysis-based method to estimate the number of generations a large language model takes to produce unsafe responses, providing a calibrated lower bound with coverage guarantees for safety assessment.
Contribution
It introduces a novel calibration technique and optimized sampling scheme to accurately estimate time-to-unsafe-sampling in LLMs with theoretical guarantees.
Findings
Method provides rigorous safety bounds for LLMs.
Sample efficiency is improved through optimized allocation.
Experimental results validate theoretical guarantees.
Abstract
We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Data Quality and Management
