Automating Steering for Safe Multimodal Large Language Models

Lyucheng Wu; Mengru Wang; Ziwen Xu; Tri Cao; Nay Oo; Bryan Hooi; Shumin Deng

arXiv:2507.13255·cs.CL·September 24, 2025

Automating Steering for Safe Multimodal Large Language Models

Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng

PDF

Open Access 1 Video

TL;DR

AutoSteer is an inference-time intervention framework for multimodal large language models that enhances safety by detecting and mitigating toxic outputs without requiring model fine-tuning.

Contribution

It introduces a modular, adaptive safety intervention method with novel safety scoring and detection components for safer multimodal AI deployment.

Findings

01

Significantly reduces attack success rates across multiple safety benchmarks.

02

Maintains core capabilities of the underlying multimodal models.

03

Operates without fine-tuning the base models.

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Automating Steering for Safe Multimodal Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling