Steering Away from Harm: An Adaptive Approach to Defending Vision   Language Model Against Jailbreaks

Han Wang; Gang Wang; Huan Zhang

arXiv:2411.16721·cs.CV·May 5, 2025

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Han Wang, Gang Wang, Huan Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ASTRA, an adaptive defense method for Vision Language Models that effectively mitigates adversarial jailbreaks by steering models away from harmful feature directions with minimal impact on benign inputs.

Contribution

ASTRA is a novel, efficient adaptive steering approach that identifies transferable harmful feature directions and dynamically removes them during inference to defend against various adversarial attacks.

Findings

01

ASTRA achieves state-of-the-art defense performance across multiple models.

02

It maintains high accuracy on benign inputs while resisting adversarial jailbreaks.

03

ASTRA demonstrates strong transferability against unseen attack types.

Abstract

Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ASTRAL-Group/ASTRA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics