The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang

TL;DR
The paper introduces KeyTailor, a keyframe-driven framework for improving video virtual try-on by enhancing garment details and background consistency, supported by a new high-definition dataset.
Contribution
It proposes a novel keyframe-driven details injection strategy and modules, along with a large-scale high-definition dataset, to improve realism and efficiency in video virtual try-on.
Findings
KeyTailor outperforms existing methods in garment fidelity.
The dataset ViT-HD contains 15,070 high-quality videos at 810x1080 resolution.
The approach maintains background integrity without increasing model complexity.
Abstract
Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
