OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

Weixuan Zeng; Pengcheng Wei; Huaiqing Wang; Boheng Zhang; Jia Sun; Dewen Fan; Lin HE; Long Chen; Qianqian Gan; Fan Yang; Tingting Gao

arXiv:2603.19643·cs.CV·March 25, 2026

OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

Weixuan Zeng, Pengcheng Wei, Huaiqing Wang, Boheng Zhang, Jia Sun, Dewen Fan, Lin HE, Long Chen, Qianqian Gan, Fan Yang, Tingting Gao

PDF

Open Access

TL;DR

OmniDiT is a unified diffusion transformer framework for virtual try-on that improves detail preservation, scene generalization, and inference efficiency by integrating multiple techniques and a large, diverse dataset.

Contribution

The paper introduces OmniDiT, a novel diffusion transformer model that unifies try-on and try-off tasks, incorporating a new dataset, adaptive encoding, and shifted window attention for improved performance.

Findings

01

Achieves state-of-the-art results in model-free VTON and VTOFF tasks.

02

Performs comparably to SOTA in model-based VTON.

03

Demonstrates effective handling of complex scenes and detailed garment fitting.

Abstract

Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis