JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on
Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, Fan Wang

TL;DR
JCo-MVTON is a novel mask-free virtual try-on framework that integrates diffusion models with multi-modal control signals for precise garment-person synthesis, outperforming existing methods in benchmarks and real-world scenarios.
Contribution
The paper introduces a diffusion transformer-based virtual try-on system with multi-modal control and a bidirectional dataset generation strategy, enhancing realism and control.
Findings
Achieves state-of-the-art results on DressCode benchmark.
Demonstrates superior generalization to real-world images.
Provides a new dataset with high-quality, diverse synthetic images.
Abstract
Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
