Jailbreaker in Jail: Moving Target Defense for Large Language Models
Bocheng Chen, Advait Paliwal, Qiben Yan

TL;DR
This paper introduces a moving target defense system for large language models that significantly reduces their vulnerability to adversarial attacks while maintaining helpfulness and harmlessness.
Contribution
We propose a novel MTD-enhanced LLM system that improves robustness against adversarial queries by using multiple models and filtering mechanisms.
Findings
Reduces attack success rate from 37.5% to 0%
Decreases response refusal rate from 50% to 0%
Enhances safety and reliability of LLMs against adversarial attacks
Abstract
Large language models (LLMs), known for their capability in understanding and following instructions, are vulnerable to adversarial attacks. Researchers have found that current commercial LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers when faced with adversarial queries. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system. The system aims to deliver non-toxic answers that align with outputs from multiple model candidates, making them more robust against adversarial attacks. We design a query and output analysis model to filter out unsafe or non-responsive answers. %to achieve the two objectives of randomly selecting outputs from different LLMs. We evaluate over 8 most recent chatbot models with state-of-the-art adversarial queries. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning
