Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

Advaith Ravishankar; Serena Liu; Mingyang Wang; Todd Zhou; Jeffrey Zhou; Arnav Sharma; Ziling Hu; L\'eopold Das; Abdulaziz Sobirov; Faizaan Siddique; Freddy Yu; Seungjoo Baek; Yan Luo; Mengyu Wang

arXiv:2603.14186·cs.CV·May 11, 2026

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, L\'eopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang

PDF

1 Datasets

TL;DR

This paper benchmarks one-step generative models against multi-step diffusion and flow models on ImageNet, revealing tradeoffs in quality, alignment, and guidance effects, and introduces new evaluation metrics for semantic fidelity.

Contribution

It provides a comprehensive, standardized comparison of one-step and multi-step models across multiple datasets and introduces scaled FID and Inception Score metrics for better semantic assessment.

Findings

01

One-step models can match multi-step models in certain metrics but face tradeoffs in alignment and human preference.

02

Guidance techniques can improve FID but may harm semantic alignment and visual quality.

03

New metrics csFID, psFID, csIS, psIS help diagnose semantic fidelity in image generation.

Abstract

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, we benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

harvardairobotics/reLAIONet
dataset· 271 dl
271 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.