Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Ryan Wong (1); Hosea David Yu Fei Ng (1); Dhananjai Sharma (1); Glenn Jun Jie Ng (1); Kavishvaran Srinivasan (1) ((1) National University of Singapore)

arXiv:2511.18933·cs.CR·November 25, 2025

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Ryan Wong (1), Hosea David Yu Fei Ng (1), Dhananjai Sharma (1), Glenn Jun Jie Ng (1), Kavishvaran Srinivasan (1) ((1) National University of Singapore)

PDF

Open Access

TL;DR

This paper systematically analyzes jailbreak threats to Large Language Models and proposes three defense strategies—prompt sanitization, logit steering, and domain-specific agents—to mitigate these exploits effectively.

Contribution

It introduces a comprehensive taxonomy of jailbreak defenses and presents three novel, integrated defense methods with experimental validation on benchmark datasets.

Findings

01

Substantial reduction in attack success rate

02

Full mitigation achieved with agent-based defense

03

Trade-offs identified between safety, performance, and scalability

Abstract

Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security