ETA: Evaluating Then Aligning Safety of Vision Language Models at   Inference Time

Yi Ding; Bolian Li; Ruqi Zhang

arXiv:2410.06625·cs.CV·February 11, 2025·2 cites

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Yi Ding, Bolian Li, Ruqi Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ETA, a two-phase inference-time framework for improving the safety of vision-language models by evaluating and aligning their outputs, significantly reducing unsafe responses and enhancing helpfulness.

Contribution

The paper presents a novel inference-time safety alignment method for VLMs that evaluates inputs and responses before aligning unsafe behaviors, improving safety and usefulness without extensive resources.

Findings

01

Reduces unsafe rate by 87.5% in cross-modality attacks

02

Achieves 96.6% win-ties in GPT-4 helpfulness evaluation

03

Outperforms baseline safety methods in experiments

Abstract

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dripnowhy/eta
pytorchOfficial

Videos

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings