CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong,, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang

TL;DR
CatV2TON is a unified diffusion transformer model that effectively handles both image and video virtual try-on tasks, ensuring high-quality, temporally consistent results across static and dynamic scenarios with efficient long-video generation strategies.
Contribution
The paper introduces CatV2TON, a novel diffusion transformer approach that supports both image and video try-on with temporal concatenation and a new dataset, improving robustness and efficiency.
Findings
Outperforms existing methods in image and video try-on tasks.
Achieves high temporal consistency in long videos.
Efficient inference with reduced resource usage.
Abstract
Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Computer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training
