E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot   Object Customization

Trung X. Pham; Zhang Kang; Ji Woo Hong; Xuran Zheng; Chang D. Yoo

arXiv:2502.09164·cs.CV·February 14, 2025

E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Trung X. Pham, Zhang Kang, Ji Woo Hong, Xuran Zheng, Chang D. Yoo

PDF

Open Access

TL;DR

E-MD3C introduces a lightweight, efficient masked diffusion transformer framework for zero-shot object image customization, significantly reducing computational resources while maintaining high-quality results.

Contribution

The paper presents a novel, resource-efficient masked diffusion transformer architecture with disentangled conditions and a learnable collector for improved zero-shot image customization.

Findings

01

Outperforms existing methods on VITON-HD dataset across multiple metrics.

02

Uses only 1/4 of the parameters and 2/3 of GPU memory compared to Unet-based models.

03

Achieves 2.5x faster inference speed with comparable or better quality.

Abstract

We propose E-MD3C ( $\underline{E}$ fficient $\underline{M}$ asked $\underline{D}$ iffusion Transformer with Disentangled $\underline{C}$ onditions and $\underline{C}$ ompact $\underline{C}$ ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer · Adam · Softmax · Absolute Position Encodings · Dropout · Label Smoothing