Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

Manos Plitsis; Giorgos Bouritsas; Vassilis Katsouros; Yannis Panagakis

arXiv:2512.08724·cs.LG·March 18, 2026

Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Bias-Guided Prompt Search (BGPS), an automated method to identify and analyze hidden social biases in text-to-image models by generating prompts that amplify biased outputs, revealing vulnerabilities and aiding bias mitigation.

Contribution

The paper presents BGPS, a novel framework combining LLMs and attribute classifiers to automatically discover subtle biases in TTI models, improving bias detection beyond curated prompt datasets.

Findings

01

Discovered previously undocumented biases in Stable Diffusion 1.5

02

Generated interpretable prompts that reveal biases effectively

03

Enhanced bias detection compared to existing prompt optimization methods

Abstract

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The proposed method can discover subtle and previously undocumented biases, which expands the evaluation space beyond curated datasets. 2. Compared with gradient-based methods, the proposed method can generate more natural text.

Weaknesses

1. The scope of biased attributes evaluated (gender, race) is limited by the classifiers used, though the method is generalizable. 2. Although the generated prompts look more natural than the prompts generated by gradient-based methods, as shown in figure 1, the generated prompts are still not very natural (not like common prompts written by human) 3. The technical novelty is a little limited. There is neither strong technical insight or theoretical analysis.

Reviewer 02Rating 6Confidence 5

Strengths

This paper tackles the problem of exposing biases in T2I models using realistic, neutral-sounding prompts, which is an important problem that helps in auditing large-scale T2I models.

Weaknesses

1. Results are reported primarily on SD-1.5. Since the method depends on UNet architectures, the evaluation should be extended to SD-2.1 and SDXL to assess robustness on newer, stronger models. 2. Given the paper’s focus on residual bias in “debiased” models, the audit should also include recent debiasing methods, e.g., ITIGen, which operates in the prompt space, to strengthen the generality of the claims. 3. The paper uses simple, clean, single-person prompts. In order to test the applicability

Reviewer 03Rating 2Confidence 4

Strengths

**1. Well-motivated problem and a practical objective** The objective of developing an automated framework to audit TTI models for fairness and safety is of high practical importance for the responsible deployment of generative AI. **2. Generation of interpretable, human-readable prompts** A primary strength of the proposed method is its ability to generate human-readable and interpretable prompts. The paper correctly identifies the limitations of gradient-based optimization methods, which o

Weaknesses

**1. Insufficient Experimental Scope and Lack of Generalizability** The paper's claims of providing a generalizable framework are unsubstantiated due to a severely limited experimental scope. The evaluation is confined entirely to a single, model, Stable Diffusion 1.5. The failure to test the method on any other diverse, modern (SDXL, Flux), or transformer-based (DALL-E 3, SD3) models means it is impossible to assess whether the framework is truly general-purpose. **2. Use of Outdated and Unsp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis