Enhancing MMDiT-Based Text-to-Image Models for Similar Subject   Generation

Tianyi Wei; Dongdong Chen; Yifan Zhou; Xingang Pan

arXiv:2411.18301·cs.CV·November 28, 2024

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Tianyi Wei, Dongdong Chen, Yifan Zhou, Xingang Pan

PDF

Open Access 1 Repo

TL;DR

This paper improves text-to-image models based on MMDiT by addressing subject neglect and mixing issues through test-time optimization and novel loss functions, resulting in higher quality and success rates in generating similar subjects.

Contribution

We identify key ambiguities in MMDiT causing subject mixing and propose on-the-fly latent repair with tailored loss functions and strategies to enhance generation quality.

Findings

01

Significant improvement in generation quality on a challenging dataset

02

Higher success rates compared to existing methods

03

Effective mitigation of subject ambiguity issues

Abstract

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wtybest/enmmdit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax