ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding, Bolian Li, Ruqi Zhang

TL;DR
This paper introduces ETA, a two-phase inference-time framework for improving the safety of vision-language models by evaluating and aligning their outputs, significantly reducing unsafe responses and enhancing helpfulness.
Contribution
The paper presents a novel inference-time safety alignment method for VLMs that evaluates inputs and responses before aligning unsafe behaviors, improving safety and usefulness without extensive resources.
Findings
Reduces unsafe rate by 87.5% in cross-modality attacks
Achieves 96.6% win-ties in GPT-4 helpfulness evaluation
Outperforms baseline safety methods in experiments
Abstract
Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
