A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
Ivan Zhang

TL;DR
This paper presents a real-time, self-tuning moderation framework that effectively defends large language models against adversarial jailbreaks, adapting quickly without extensive retraining.
Contribution
The authors introduce a novel, lightweight, self-tuning moderator framework that enhances LLM security against adversarial prompts in real-time.
Findings
Effective against modern jailbreaks
Outperforms traditional fine-tuning methods
Maintains low training overhead
Abstract
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
