Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

TL;DR
Text2VLM introduces a pipeline to convert text-only datasets into multimodal prompts, enabling evaluation of VLMs' robustness against prompt injection attacks and revealing vulnerabilities in current models.
Contribution
The paper presents a novel multi-stage pipeline that adapts text-only datasets into multimodal formats for evaluating VLM alignment and robustness.
Findings
Open-source VLMs are more vulnerable to prompt injection with visual inputs.
Significant performance gap between open-source and closed-source VLMs.
Text2VLM is validated through human evaluations and aligns with human expectations.
Abstract
The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models' alignment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
