From static to adaptive: immune memory-based jailbreak detection for large language models
Jun Leng, Yu Liu, Litian Zhang, Ruihan Hu, Zhuting Fang, Xi Zhang

TL;DR
This paper introduces IMAG, an immune memory-inspired adaptive framework for detecting and mitigating jailbreak attacks on large language models, improving robustness and adaptability over static methods.
Contribution
The paper proposes a novel immune memory-based framework that enables LLMs to adaptively detect and respond to evolving jailbreak attacks, surpassing static detection methods.
Findings
Achieves 94% average detection accuracy across diverse attacks
Outperforms state-of-the-art static detection baselines
Demonstrates effective adaptive defense in multiple LLMs
Abstract
Large Language Models (LLMs) serve as the backbone of modern AI systems, yet they remain susceptible to adversarial jailbreak attacks. Consequently, robust detection of such malicious inputs is paramount for ensuring model safety. Traditional detection methods typically rely on external models trained on fixed, large-scale datasets, which often incur significant computational overhead. While recent methods shift toward leveraging internal safety signals of models to enable more lightweight and efficient detection. However, these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard (IMAG) framework. By distilling and encoding safety patterns into a persistent, evolvable memory bank, IMAG enables adaptive generalization to emerging threats.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Immune Systems Applications · Adversarial Robustness in Machine Learning · vaccines and immunoinformatics approaches
