TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin

TL;DR
TwinFlow introduces a one-step generative model training framework that significantly accelerates large multi-modal model inference, achieving high-quality results with just a single evaluation step, thus reducing computational costs drastically.
Contribution
The paper presents TwinFlow, a novel training method for 1-step generative models that eliminates the need for fixed teachers and adversarial training, enabling efficient large-scale model inference.
Findings
Achieves 0.83 GenEval score with 1-NFE on text-to-image tasks.
Matches 100-NFE model performance with just 1-NFE, reducing computation by 100x.
Outperforms existing baselines like SANA-Sprint and RCGM in efficiency and quality.
Abstract
Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained…
Peer Reviews
Decision·ICLR 2026 Poster
- Successfully applying the method to 20B parameter models demonstrates promising scalability. - The 1-NFE performance closely matching 100-NFE baselines is impressive.
- Theoretical flaws. If I understand correctly, the paper trains a model with a single-timestep condition to handle generation at arbitrary steps. **However, such a model is no longer a score model, and using it in reverse KL to compute the gradient of logp is theoretically flawed**. - BLIP-3o-60K contains samples carefully generated using GenEval prompts, and training on BLIP-3o could yield very high GenEval scores. GenEval score is the primary metric used in this paper and highlighted in the a
- While many distillation or few step approaches leverage additional trained or teacher networks, this method does not require any additional separate network. - While strongly leveraging ideas by the DMD (Distribution Matching Distillation paper) paper, the authors found an elegant way to implement the idea of an adversarial network component into the existing flow matching / diffusion frameworks by extending the time domain to [-1,1] where [-1,0] corresponds to the fake data domain. - The pres
- Intuition and explanation: While the intuition of an adversarial component is clear in DMD is clear, certain design choices had to be taken to make this method work: a) Representing fake data by negative time steps b) Having the adversarial loss learning the negative trajectory c) A rectification loss that should straighten the trajectory Further, during training N=2 is adopted, meaning 2nd order approximations are done in the RCGM framework, giving a strong few-step prior. The paper does not
- I like its feature where only one model is trained, and that it achieves good performance on T2I models without GAN. Compared to SiD and VSD, it combines the pretrained teacher and the true score net, while makes the fake score net online, which is a good improvement. - The performance is strong, especially for the scalabity on large-scale (20B) models.
- This method is similar to existing distillation methods like SiD, DMD and VSD, for example, DMD requires a trained diffusion teacher and a online-training generator, corresponding to the $t\in [0,1]$ part of TwinFlow; the auxiliary fake score network corresponds to the $t\in[-1,0]$ part. The authors should provide a detailed comparison in their paper. - Since this approach integrate the function of the three models in DMD into one model, there is potential performance degration since the total
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
