Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Enyi Shi, Fei Shen, Shuyi Miao, Linxia Zhu, Pengyang Shao, Jinhui Tang, and Tat-Seng Chua

TL;DR
Precise Shield introduces a neuron-level approach to enhance VLLM safety by identifying and constraining critical safety neurons, improving robustness against multimodal and multilingual attacks while maintaining generalization.
Contribution
The paper presents a novel two-stage framework that locates safety neurons and constrains updates within this subspace, enabling transfer and preservation of safety across languages and modalities.
Findings
Safety neurons are shared across languages and modalities.
Gradient masking affects fewer than 0.03% of parameters.
The method improves safety without sacrificing multilingual and multimodal performance.
Abstract
In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
