UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang; Yeying Jin; Xin Li; Yan Zhang; Xiaofeng Cong; Cong Wang; Fengcai Qiao; zhichao Lian

arXiv:2511.15831·cs.CV·February 11, 2026

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian

PDF

Open Access 1 Video

TL;DR

UniFit introduces a universal virtual try-on framework guided by multimodal large language models, effectively bridging semantic gaps and learning complex tasks from limited data, achieving state-of-the-art results across diverse try-on scenarios.

Contribution

The paper presents UniFit, a novel VTON framework that leverages MLLM-guided semantic alignment and a two-stage training strategy to handle diverse tasks with limited data.

Findings

01

Supports a wide range of VTON tasks including multi-garment try-on.

02

Achieves state-of-the-art performance on multiple benchmarks.

03

Effectively reduces semantic gap between text instructions and images.

Abstract

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Computer Graphics and Visualization Techniques