Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan; Ayushi Mehrotra

arXiv:2511.18721·cs.LG·March 10, 2026

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra

PDF

Open Access

TL;DR

This paper introduces a probabilistic certification framework for SmoothLLM that offers more realistic safety guarantees against jailbreaking attacks by modeling attack success with empirical data, improving trustworthiness.

Contribution

It proposes the (k, ε)-unstable framework, providing a data-informed, probabilistic safety certificate for LLM defenses against diverse attacks, addressing limitations of previous strict assumptions.

Findings

01

Derived a new lower bound on defense probability using empirical attack models

02

Introduced the (k, ε)-unstable framework for practical safety guarantees

03

Enhanced trustworthiness of LLM safety certification in real-world scenarios

Abstract

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $ε$ )-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $ε$ )-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Access Control and Trust