Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho

TL;DR
This paper introduces the Retention Score, a novel metric to quantify jailbreak risks in Vision-Language Models, demonstrating its effectiveness in assessing model robustness against adversarial attacks across multiple VLMs.
Contribution
It proposes the Retention Score as a certified robustness metric and evaluates the vulnerability of various VLMs, including API-based models, to jailbreak attacks.
Findings
Most VLMs with visual components are less robust than plain VLMs.
Security settings in API models significantly impact robustness scores.
The proposed method is time-efficient and consistent in ranking model robustness.
Abstract
The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbf{Retention Score}. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Digital and Cyber Forensics · Occupational Health and Safety Research
MethodsDiffusion
