CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On   with Temporal Concatenation

Zheng Chong; Wenqing Zhang; Shiyue Zhang; Jun Zheng; Xiao Dong,; Haoxiang Li; Yiling Wu; Dongmei Jiang; and Xiaodan Liang

arXiv:2501.11325·cs.CV·January 22, 2025

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong,, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang

PDF

Open Access 1 Repo

TL;DR

CatV2TON is a unified diffusion transformer model that effectively handles both image and video virtual try-on tasks, ensuring high-quality, temporally consistent results across static and dynamic scenarios with efficient long-video generation strategies.

Contribution

The paper introduces CatV2TON, a novel diffusion transformer approach that supports both image and video try-on with temporal concatenation and a new dataset, improving robustness and efficiency.

Findings

01

Outperforms existing methods in image and video try-on tasks.

02

Achieves high temporal consistency in long videos.

03

Efficient inference with reduced resource usage.

Abstract

Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zheng-chong/catv2ton
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Computer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training