Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation
Tianyi Wei, Dongdong Chen, Yifan Zhou, Xingang Pan

TL;DR
This paper improves text-to-image models based on MMDiT by addressing subject neglect and mixing issues through test-time optimization and novel loss functions, resulting in higher quality and success rates in generating similar subjects.
Contribution
We identify key ambiguities in MMDiT causing subject mixing and propose on-the-fly latent repair with tailored loss functions and strategies to enhance generation quality.
Findings
Significant improvement in generation quality on a challenging dataset
Higher success rates compared to existing methods
Effective mitigation of subject ambiguity issues
Abstract
Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax
