Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Hao Wang; Yiqun Sun; Pengfei Wei; Lawrence B. Hsieh; Daisuke Kawahara

arXiv:2605.07447·cs.CV·May 11, 2026

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Hao Wang, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Daisuke Kawahara

PDF

1 Repo

TL;DR

This paper introduces SAEgis, a lightweight plug-and-play framework using sparse autoencoders to detect adversarial attacks in vision-language models, enhancing safety without extra adversarial training.

Contribution

It presents the first application of sparse autoencoders as a plug-and-play method for adversarial attack detection in VLMs, improving cross-domain generalization and robustness.

Findings

01

SAEgis achieves strong detection performance across various attack settings.

02

Combining multiple layer signals enhances robustness and stability.

03

The method requires no additional adversarial training and adds minimal overhead.

Abstract

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

conan1024hao/SAEgis
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.