PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Haohua Chen; Tianze Zhou; Wei Zhu; Runqi Wang; Yandong Guan; Dejia Song; Yibo Chen; Xu Tang; Yao Hu; Lu Sheng; Zhiyong Wu

arXiv:2603.11675·cs.CV·March 13, 2026

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan, Dejia Song, Yibo Chen, Xu Tang, Yao Hu, Lu Sheng, Zhiyong Wu

PDF

Open Access

TL;DR

PROMO introduces a promptable, efficient, and high-fidelity virtual try-on framework that leverages flow-matching transformers with multi-modal conditioning, outperforming prior methods in realism and speed.

Contribution

The paper presents PROMO, a novel VTON framework that combines flow-matching transformers with latent multi-modal conditioning for improved efficiency and quality.

Findings

01

Outperforms prior VTON methods in visual fidelity.

02

Achieves a better balance between quality and inference speed.

03

Demonstrates generalization to broader image editing tasks.

Abstract

Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Multimodal Machine Learning Applications