MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong-Yee Lee, Xiu Li

TL;DR
MaTe is a simplified diffusion framework that generates high-quality material transfer images without text guidance or extra networks, outperforming existing methods in quality and efficiency.
Contribution
MaTe introduces a unified, training-free diffusion approach that processes images at the token level with multi-modal attention, eliminating the need for complex architectures.
Findings
MaTe achieves superior visual quality compared to state-of-the-art methods.
MaTe operates efficiently without additional training or fine-tuning.
MaTe preserves detailed feature alignment in generated images.
Abstract
Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
