A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang

arXiv:2508.07139·cs.CR·August 12, 2025

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang

PDF

Open Access

TL;DR

This paper presents a real-time, self-tuning moderation framework that effectively defends large language models against adversarial jailbreaks, adapting quickly without extensive retraining.

Contribution

The authors introduce a novel, lightweight, self-tuning moderator framework that enhances LLM security against adversarial prompts in real-time.

Findings

01

Effective against modern jailbreaks

02

Outperforms traditional fine-tuning methods

03

Maintains low training overhead

Abstract

Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing