E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
Tong Shen, Jingai Yu, Dong Zhou, Dong Li, Emad Barsoum

TL;DR
E-MMDiT is a lightweight, efficient multimodal diffusion transformer for fast image synthesis that requires minimal training resources and introduces novel compression and attention techniques.
Contribution
The paper introduces E-MMDiT, a compact diffusion model with innovative token compression and attention methods, enabling high-quality image generation with low resource requirements.
Findings
Achieves 0.66 score on GenEval with 25M data in 1.5 days on 8 GPUs.
Uses novel multi-path compression and Position Reinforcement for efficiency.
Demonstrates competitive image synthesis quality with significantly reduced resources.
Abstract
Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Computer Graphics and Visualization Techniques
