Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Zaitang Li; Pin-Yu Chen; Tsung-Yi Ho

arXiv:2412.17544·cs.AI·December 24, 2024

Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho

PDF

Open Access 1 Video

TL;DR

This paper introduces the Retention Score, a novel metric to quantify jailbreak risks in Vision-Language Models, demonstrating its effectiveness in assessing model robustness against adversarial attacks across multiple VLMs.

Contribution

It proposes the Retention Score as a certified robustness metric and evaluates the vulnerability of various VLMs, including API-based models, to jailbreak attacks.

Findings

01

Most VLMs with visual components are less robust than plain VLMs.

02

Security settings in API models significantly impact robustness scores.

03

The proposed method is time-efficient and consistent in ranking model robustness.

Abstract

The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbf{Retention Score}. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Retention Score: Quantifying Jailbreak Risks for Vision Language Models· underline

Taxonomy

TopicsArtificial Intelligence in Law · Digital and Cyber Forensics · Occupational Health and Safety Research

MethodsDiffusion