MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan; Tommy Tran; Vedant Yadav; Ava Cai; Kevin Zhu; Ruizhe Li; Maheep Chaudhary

arXiv:2602.18782·cs.CR·February 24, 2026

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary

PDF

Open Access

TL;DR

MANATEE is a novel inference-time defense for large language models that uses density estimation and diffusion to detect and mitigate adversarial jailbreak attacks without retraining or modifying the model architecture.

Contribution

It introduces a diffusion-based method that projects anomalous representations toward safe regions, avoiding the need for harmful training data or model modifications.

Findings

01

Reduces attack success rate by up to 100% on certain datasets.

02

Preserves model utility on benign inputs.

03

Works across multiple LLMs like Mistral-7B, Llama-3.1-8B, and Gemma-2-9B.

Abstract

Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Explainable Artificial Intelligence (XAI)