E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization
Trung X. Pham, Zhang Kang, Ji Woo Hong, Xuran Zheng, Chang D. Yoo

TL;DR
E-MD3C introduces a lightweight, efficient masked diffusion transformer framework for zero-shot object image customization, significantly reducing computational resources while maintaining high-quality results.
Contribution
The paper presents a novel, resource-efficient masked diffusion transformer architecture with disentangled conditions and a learnable collector for improved zero-shot image customization.
Findings
Outperforms existing methods on VITON-HD dataset across multiple metrics.
Uses only 1/4 of the parameters and 2/3 of GPU memory compared to Unet-based models.
Achieves 2.5x faster inference speed with comparable or better quality.
Abstract
We propose E-MD3C (fficient asked iffusion Transformer with Disentangled onditions and ompact ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer · Adam · Softmax · Absolute Position Encodings · Dropout · Label Smoothing
