CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing
Zixia Wang, Gaojie Jin, Jia Hu, Ronghui Mu

TL;DR
CluCERT introduces a clustering-guided denoising framework that improves the certification of LLM robustness against adversarial prompts by providing tighter bounds and reducing computational costs.
Contribution
The paper presents a novel clustering-based denoising approach that enhances robustness certification of LLMs with theoretical validation and efficiency improvements.
Findings
Outperforms existing methods in robustness bounds
Achieves higher computational efficiency
Effective in various downstream and jailbreak scenarios
Abstract
Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
