ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
Haifeng Ni, Ming Xu

TL;DR
ITVTON introduces an efficient diffusion transformer framework for virtual try-on that combines image and text inputs, achieving high realism with reduced computational complexity and superior performance over existing methods.
Contribution
The paper presents ITVTON, a novel diffusion transformer-based virtual try-on model that simplifies architecture and enhances realism by integrating image and text features within a single generator.
Findings
Outperforms baseline methods in quality and realism.
Reduces computational cost by focusing on attention parameters.
Demonstrates robustness on large real-world dataset.
Abstract
Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Image Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need · Diffusion · Inpainting
