DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Tongchun Zuo; Zaiyu Huang; Shuliang Ning; Ente Lin; Chao Liang; Zerong Zheng; Jianwen Jiang; Yuan Zhang; Mingyuan Gao; Xin Dong

arXiv:2508.02807·cs.CV·August 6, 2025

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong

PDF

TL;DR

DreamVVT introduces a two-stage diffusion transformer framework that leverages unpaired data and pretrained models to achieve realistic, temporally consistent video virtual try-on in unconstrained environments.

Contribution

It presents a novel stage-wise approach combining multi-frame synthesis and pretrained video generation models for improved realism and temporal stability in video virtual try-on.

Findings

01

Outperforms existing methods in garment detail preservation.

02

Achieves superior temporal consistency in real-world videos.

03

Effectively leverages unpaired data and pretrained models.

Abstract

Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.