Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update
Qing Li, Jiahui Geng, Zongxiong Chen, Kun Song, Lei Ma and, Fakhri Karray

TL;DR
This paper introduces an internal activation revision method that enhances the safety of vision-language models by adjusting internal activations during generation, effectively reducing harmful outputs without retraining the entire model.
Contribution
The proposed activation revision approach is a novel technique that improves VLM safety by revising internal activations at multiple levels without parameter updates.
Findings
Reduces attack success rates by up to 52.98% across benchmarks.
Significantly improves safety with minimal impact on helpfulness.
Offers flexible revision strategies at layer and head levels.
Abstract
Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content compared to their backbone large language models (LLMs). Our investigation reveals that the integration of images significantly shifts the model's internal activations during the forward pass, diverging from those triggered by textual input. Moreover, the safety alignments of LLMs embedded within VLMs are not sufficiently robust to handle the activations discrepancies, making the models vulnerable to even the simplest jailbreaking attacks. To address this issue, we propose an \textbf{internal activation revision} approach that efficiently revises activations during generation, steering the model toward safer outputs. Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing
