TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Dingbao Shao; Song Wu; Shenyi Wang; Ye Wang; Ziheng Tang; Fei Liu; Jiang Lin; Xinyu Chen; Qian Wang; Ying Tai; Jian Yang; Zili Yi

arXiv:2604.27958·cs.CV·May 1, 2026

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Dingbao Shao, Song Wu, Shenyi Wang, Ye Wang, Ziheng Tang, Fei Liu, Jiang Lin, Xinyu Chen, Qian Wang, Ying Tai, Jian Yang, Zili Yi

PDF

TL;DR

This paper introduces TripVVT, a large-scale in-the-wild triplet dataset and a diffusion transformer-based framework for improved, realistic, and stable video virtual try-on, addressing current limitations in data scarcity and mask usage.

Contribution

It provides the largest diverse triplet dataset, a novel human-mask prior-based framework, and a comprehensive benchmark for evaluating video virtual try-on methods.

Findings

01

TripVVT outperforms existing systems in video quality and garment fidelity.

02

The dataset and benchmark facilitate progress in realistic video virtual try-on.

03

The proposed method generalizes well to challenging in-the-wild scenarios.

Abstract

Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce **TripVVT-10K**, the largest and most diverse in-the-wild triplet dataset to date, providing explicit video-level cross-garment supervision that existing video datasets lack. Built upon this resource, we develop **TripVVT**, a Diffusion Transformer-based framework that replaces fragile garment masks with a simple, stable human-mask prior, enabling reliable background preservation while remaining robust to real-world motion, occlusion, and cluttered scenes. To support comprehensive evaluation, we further establish **TripVVT-Bench**, a 100-case benchmark covering diverse garments, complex environments, and multi-person scenarios, with metrics spanning video quality, try-on fidelity, background…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.