Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Seungyong Lee, Jeong-gi Kwak

TL;DR
Voost is a unified diffusion transformer framework that jointly models virtual try-on and try-off tasks, improving realism and consistency in garment synthesis across pose and appearance variations.
Contribution
It introduces a scalable, joint learning approach for try-on and try-off with bidirectional supervision and novel inference techniques, without task-specific networks or extra labels.
Findings
Achieves state-of-the-art results on try-on and try-off benchmarks.
Outperforms strong baselines in alignment accuracy and visual fidelity.
Demonstrates robust generalization across diverse poses and garments.
Abstract
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
