Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM
Adarsh Kumarappan, Ayushi Mehrotra

TL;DR
This paper introduces a probabilistic certification framework for SmoothLLM that offers more realistic safety guarantees against jailbreaking attacks by modeling attack success with empirical data, improving trustworthiness.
Contribution
It proposes the (k, ε)-unstable framework, providing a data-informed, probabilistic safety certificate for LLM defenses against diverse attacks, addressing limitations of previous strict assumptions.
Findings
Derived a new lower bound on defense probability using empirical attack models
Introduced the (k, ε)-unstable framework for practical safety guarantees
Enhanced trustworthiness of LLM safety certification in real-world scenarios
Abstract
The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, )-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, )-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Access Control and Trust
