Unmasking the Canvas: A Dynamic Benchmark for Image Generation   Jailbreaking and LLM Content Safety

Variath Madhupal Gautham Nair; Vishal Varma Dantuluri

arXiv:2505.04146·cs.CL·May 8, 2025

Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Variath Madhupal Gautham Nair, Vishal Varma Dantuluri

PDF

Open Access

TL;DR

This paper introduces a dynamic benchmark dataset called UTCB to evaluate the vulnerability of large language models in generating unsafe images through prompt-based jailbreaks, highlighting the need for improved content safety measures.

Contribution

The paper presents a scalable, evolving benchmark dataset with structured prompt strategies and multi-tiered verification to assess and improve LLM content safety in image generation.

Findings

01

Prompt engineering can induce unsafe image generation.

02

Multilingual obfuscation challenges LLM safety defenses.

03

The benchmark supports automated and manual verification tiers.

Abstract

Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Artificial Intelligence in Healthcare and Education