Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao

TL;DR
This paper introduces SABER, a scalable statistical method to accurately predict large-scale adversarial attack success rates on LLMs, revealing risks underestimated by standard evaluations.
Contribution
We develop a Beta distribution-based scaling law for predicting adversarial risk in LLMs under Best-of-N sampling, enabling reliable extrapolation from small samples.
Findings
SABER predicts attack success rates with 86.2% lower error than baseline.
Models appearing robust may have nonlinear risk amplification under parallel attacks.
Heterogeneous risk scaling profiles are observed across models.
Abstract
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Network Security and Intrusion Detection
