TL;DR
This paper extends one-step image generation from class labels to text inputs by integrating LLM-based text encoders into the MeanFlow framework, enabling richer content creation with improved performance.
Contribution
It introduces a novel approach to incorporate powerful text encoders into MeanFlow for effective text-conditioned image synthesis, addressing challenges of discriminability in limited-step generation.
Findings
Successful adaptation of MeanFlow for text-conditioned image generation.
Significant improvements in generation quality demonstrated on diffusion models.
Analysis reveals the importance of high discriminability in text features for limited-step generation.
Abstract
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
