Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu; Xiaojuan Qi; Zhengqi Li; Kai Zhang; Richard Zhang; Zhe Lin; Eli Shechtman; Tianyu Wang; Yotam Nitzan

arXiv:2512.22374·cs.CV·December 30, 2025

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan

PDF

Open Access

TL;DR

Self-E introduces a novel self-evaluating training method for text-to-image generation that supports any-step inference, achieving high quality with fewer steps and outperforming traditional models in speed and scalability.

Contribution

It presents the first from-scratch, any-step text-to-image model using self-evaluation, bridging local supervision and global matching for efficient high-quality generation.

Findings

01

Excels in few-step generation with high quality.

02

Competitive with state-of-the-art models at 50 steps.

03

Performance improves monotonically with more inference steps.

Abstract

We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning