OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Ziheng Cheng; Yixiao Huang; Hui Xu; Somayeh Sojoudi; Xuandong Zhao; Dawn Song; Song Mei

arXiv:2505.21347·cs.LG·October 28, 2025

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei

PDF

Open Access 1 Repo

TL;DR

This paper introduces OVERT, a large-scale benchmark for evaluating over-refusal in text-to-image models, revealing widespread cautious behavior that hampers utility and exploring prompt rewriting as a mitigation strategy.

Contribution

The paper presents the first systematic benchmark for over-refusal in T2I models, including a large dataset and evaluation framework for safety-utility trade-offs.

Findings

01

Over-refusal is prevalent across various safety categories.

02

Prompt rewriting often reduces faithfulness to original prompts.

03

The framework can generate customized safety evaluation data.

Abstract

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $over-refusal$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ( $OVE$ r- $R$ efusal evaluation on $T$ ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yixiao-huang/OVERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning