UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment
Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian

TL;DR
UniFit introduces a universal virtual try-on framework guided by multimodal large language models, effectively bridging semantic gaps and learning complex tasks from limited data, achieving state-of-the-art results across diverse try-on scenarios.
Contribution
The paper presents UniFit, a novel VTON framework that leverages MLLM-guided semantic alignment and a two-stage training strategy to handle diverse tasks with limited data.
Findings
Supports a wide range of VTON tasks including multi-garment try-on.
Achieves state-of-the-art performance on multiple benchmarks.
Effectively reduces semantic gap between text instructions and images.
Abstract
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Computer Graphics and Visualization Techniques
