Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie; Shitong Shao; Lichen Bai; Zikai Zhou; Bojun Cheng; Shuo Yang; Jun Wu; Zeke Xie

arXiv:2602.22570·cs.CV·February 27, 2026

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie

PDF

Open Access 3 Reviews

TL;DR

This paper critically examines the evaluation of diffusion guidance methods in text-to-image generation, revealing biases in current metrics, proposing a new evaluation framework, and demonstrating that scale increases can outperform many guidance techniques.

Contribution

It identifies an evaluation bias towards large guidance scales, introduces a guidance-aware evaluation framework, and shows that scale increases can rival or surpass existing guidance methods.

Findings

01

Large guidance scales bias human preference models.

02

Increasing CFG scale improves evaluation scores but damages image quality.

03

Most guidance methods are outperformed by simply increasing CFG scales.

Abstract

Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

The paper makes an insightful observation that commonly used human preference models, such as HPSv2 and ImageReward, exhibit a strong bias toward images generated with larger CFG scales. This bias frequently leads to higher preference scores for oversaturated or artifact-prone images, revealing a critical weakness in current evaluation protocols for diffusion guidance methods.

Weaknesses

1. Writing clarity: The overall writing could be improved to enhance readability and make the key ideas easier to follow. Some sections, particularly the methodological descriptions, are difficult to interpret without additional context or intuitive explanations. 2. Equations (5)–(7): The derivations and computation steps for Equations (5) through (7) are not clearly explained. It would be helpful if the authors explicitly detailed how these equations are obtained, what intermediate steps are o

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper emphasizes a significant evaluation flaw in which standard human preference models show a strong bias toward larger guidance scales. This is crucial for the field as it encourages a reevaluation of the contributions of each classifier-free guidance (CFG) method. 2. The proposed guidance-aware evaluation (GA-Eval) framework is both reasonable and robust. It will assist researchers in accurately assessing diffusion guidance methods. 3. The introduced Transcendent Diffusion Guidance (

Weaknesses

1. While GA-Eval is mathematically sound, it requires user studies to demonstrate its alignment with human assessments. 2. This paper identifies an evaluation flaw and proposes a guidance-aware evaluation framework. Another approach to addressing this issue is to enhance the reward model and testing benchmarks. Recent reward models, like HPSv3, may rely on vision-language models, and recent benchmarks, such as OneIG, also utilize VLMs to evaluate generated images. These new evaluation tools coul

Reviewer 03Rating 6Confidence 2

Strengths

The paper is highly original in both problem framing and methodology. While prior works have proposed new guidance strategies, this is the first to systematically diagnose and quantify a systematic bias in human-preference metrics tied to CFG scale. The technical execution is rigorous. The derivation of the effective guidance scale is mathematically sound and generalizable across sampling algorithms (including those that modify latents rather than noise directly). Experiments are extensive: mult

Weaknesses

1. Authors could provide more details on how HPS v2, ImageReward, etc., are biased toward high-saturation/high-alignment images 2. While GA-Eval is a diagnostic tool, the paper offers limited direction on how to design guidance methods that truly improve generation beyond CFG scaling. 3. Some recent text-to-image works might need examination in this paper [1,2]. It is intriguing to investigate whether these recent generative reward model are still biased. [1] Unified Multimodal Chain-of-Thought

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Topic Modeling