Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, Ren\'e Vidal

TL;DR
This paper introduces DACO, a novel framework using a large concept dictionary and sparse autoencoders to enhance safety in multimodal large language models by providing fine-grained activation control.
Contribution
DACO leverages a curated 15,000 concept dictionary and sparse coding to enable precise safety-related activation steering in MLLMs, improving safety without sacrificing performance.
Findings
DACO improves safety benchmarks like MM-SafetyBench and JailBreakV.
It maintains the general capabilities of MLLMs while enhancing safety.
Experiments on models like QwenVL, LLaVA, and InternVL validate DACO's effectiveness.
Abstract
Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
