FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

Zheng Chong; Yanwei Lei; Shiyue Zhang; Zhuandi He; Zhen Wang; Xujie Zhang; Xiao Dong; Yiling Wu; Dongmei Jiang; Xiaodan Liang

arXiv:2508.20586·cs.CV·August 29, 2025

FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

Zheng Chong, Yanwei Lei, Shiyue Zhang, Zhuandi He, Zhen Wang, Xujie Zhang, Xiao Dong, Yiling Wu, Dongmei Jiang, Xiaodan Liang

PDF

Open Access 2 Models 1 Datasets 3 Reviews

TL;DR

FastFit introduces a cacheable diffusion framework for multi-reference virtual try-on, achieving 3.5x faster inference and higher fidelity by reusing reference features across denoising steps.

Contribution

The paper proposes a novel cacheable diffusion architecture with semi-attention and class embeddings, enabling efficient multi-reference virtual try-on and introduces the DressCode-MR dataset.

Findings

01

FastFit achieves 3.5x speedup over existing methods.

02

It surpasses state-of-the-art fidelity metrics.

03

The DressCode-MR dataset supports complex multi-reference try-on research.

Abstract

Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency…

Tables5

Table 1. Table 1: Quantitative comparison of model efficiency. Best and second-best results are in bold and underlined , respectively.

Method	Params(M) $↓$	Time(s) $↓$	Memory(M) $↓$
Any2AnyTryon	16786.78	12.19	35218
PromptDresser	6011.03	4.29	17364
FitDiT	5870.80	2.00	15992
Leffa	1802.72	3.32	7996
IDM-VTON	7086.91	2.76	19072
OOTDiffusion	2229.73	1.93	10154
CatVTON	899.06	2.10	5500
FastFit	904.86	1.16	6944

Table 2. Table 2: Quantitative comparison on DressCode-MR for multi-reference try-on. Best and second-best results are in bold and underlined , respectively.

Method	Time(s) $↓$	Paired				unpair
Method	Time(s) $↓$	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$
AnyDoor	12.08	37.138	22.571	0.768	0.235	44.068	23.958
Paint-By-Example	5.22	28.296	16.092	0.796	0.215	31.135	17.887
MimicBrush	6.62	21.074	9.858	0.800	0.173	22.111	9.992
Part2Whole	5.73	20.313	8.200	0.807	0.187	24.564	10.581
CatVTON	8.94	16.131	6.980	0.856	0.106	18.339	7.458
IP-Adapter	5.62	14.459	4.144	0.861	0.089	24.139	10.783
FitDIT	3.38	14.722	5.471	0.850	0.122	15.956	5.645
FastFit	1.90	9.311	1.512	0.859	0.079	12.059	2.123

Table 3. Table 3: Quantitative comparison for single-reference virtual try-on on the VITON-HD (Choi et al., 2021 ) and DressCode (Morelli et al., 2022 ) datasets. All metrics are rounded to three decimal places. Best and second-best results in each column are in bold and underlined , respectively.

Method	VITON-HD						DressCode
Method	Paired				Unpaired		Paired				Unpaired
	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$
Any2AnyTryon	11.195	2.806	0.799	0.194	9.981	3.496	5.111	1.265	0.897	0.059	6.709	1.580
PromptDresser	5.934	0.550	0.846	0.090	8.885	0.909	9.563	4.795	0.858	0.104	10.618	4.978
FitDiT	8.176	1.079	0.838	0.096	9.979	1.478	5.571	1.901	0.899	0.058	4.805	0.712
Leffa	5.667	0.692	0.857	0.076	10.446	2.640	7.193	2.114	0.861	0.084	20.099	13.506
IDM-VTON	6.112	1.112	0.866	0.074	9.249	1.267	7.181	3.524	0.891	0.070	9.167	4.489
CatVTON	6.738	1.320	0.881	0.088	10.552	2.272	3.710	1.010	0.909	0.062	5.872	1.606
OOTDiffusion	5.762	0.267	0.843	0.072	9.082	0.702	6.975	2.014	0.873	0.077	8.121	2.886
FastFit	5.629	0.505	0.885	0.078	8.629	0.665	2.836	0.390	0.907	0.057	4.397	0.553

Table 4. Table 4: Ablation study of the key components in our model on DressCode (Morelli et al., 2022 ) dataset. The best and second-best results are demonstrated in bold and underlined , respectively.

Variants	Params(M) $↓$	Time (s) $↓$	Memory (M) $↓$	Paired				Unpaired
Variants	Params(M) $↓$	Time (s) $↓$	Memory (M) $↓$	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$
w/o KV Cache	904.86	1.92	6944	2.8585	0.3737	0.9057	0.0588	4.4206	0.5903
w/ Full Attention	904.86	2.17	6944	3.1847	0.5426	0.9056	0.0606	4.6221	0.6533
w/o Class Embed	904.85	1.16	6944	2.9146	0.4000	0.9056	0.0591	4.4624	0.5929
w/ ReferenceNet	1729.92	1.16	8770	2.8474	0.3577	0.9054	0.0588	4.4365	0.5741
FastFit	904.86	1.16	6944	2.8585	0.3737	0.9057	0.0588	4.4206	0.5903

Table 5. Table 5: Quantitative comparison on the DressCode dataset, with results broken down by category (Upper, Lower, and Dresses). The best results are marked in bold and the second-best are underlined . ↓ \downarrow indicates lower is better, while ↑ \uparrow indicates higher is better.

Methods	Upper				Lower				Dresses
Methods	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$	FID $↓$	KID $↓$	SSIM $↑$	LPIPS $↓$
Any2AnyTryon	10.4741	1.7130	0.9206	0.0476	13.1152	2.8336	0.8896	0.0655	9.1124	1.6539	0.8796	0.0636
PromptDresser	9.2447	0.7174	0.9044	0.0678	32.9093	17.9749	0.8327	0.1352	16.8179	6.8932	0.8363	0.1087
Leffa	11.2549	2.0947	0.8908	0.0578	19.6834	5.8985	0.8594	0.0908	13.3859	2.3701	0.8335	0.1029
IDM-VTON	11.2283	3.3860	0.9174	0.0547	11.7878	3.4218	0.8978	0.0655	17.5135	8.3389	0.8585	0.0882
OOTDiffusion	9.5945	1.1055	0.9040	0.0528	19.6615	6.1217	0.8751	0.0827	14.8496	4.4567	0.8393	0.0963
CatVTON	7.8465	1.0851	0.9360	0.0504	8.6135	1.6574	0.9236	0.0562	8.9453	1.0575	0.8669	0.0791
FitDiT	8.0876	0.5789	0.9241	0.0417	24.5079	11.7225	0.8944	0.0758	7.2253	0.4768	0.8789	0.0562
FastFit	6.8354	0.2453	0.9318	0.0485	7.1311	0.7981	0.9207	0.0511	7.5890	0.2446	0.8671	0.0720

Equations15

c_{p} = Concat (Interpolate (M_{a}), E (I_{co m p}))

c_{p} = Concat (Interpolate (M_{a}), E (I_{co m p}))

{R_{i}}_{i = 1}^{K} = {E (I_{R_{i}})}_{i = 1}^{K}

{R_{i}}_{i = 1}^{K} = {E (I_{R_{i}})}_{i = 1}^{K}

R_{cache}^{(i)} = ϵ_{θ} (R_{i}, E_{i}) for i = 1, \dots, K

R_{cache}^{(i)} = ϵ_{θ} (R_{i}, E_{i}) for i = 1, \dots, K

\tilde{ϵ}_{t} = ϵ_{θ} (z_{t}, c_{p}, γ (t), {R_{cache}^{(i)}}_{i = 1}^{K})

\tilde{ϵ}_{t} = ϵ_{θ} (z_{t}, c_{p}, γ (t), {R_{cache}^{(i)}}_{i = 1}^{K})

I_{out} = D (z_{0})

I_{out} = D (z_{0})

K_{full}

K_{full}

V_{full}

Attention (Q_{X}, K_{full}, V_{full}) = softmax (\frac{Q _{X} K _{full}^{T}}{d _{k}}) V_{full}

Attention (Q_{X}, K_{full}, V_{full}) = softmax (\frac{Q _{X} K _{full}^{T}}{d _{k}}) V_{full}

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

It achieves more efficient injection of reference information by caching the features of the reference image, while also providing a dataset of garments and accessories with significant academic value. The paper is clearly written and easily understandable, with extensive experiments conducted.

Weaknesses

The most significant issue with this paper is that its core algorithmic contribution lies in proposing a method to cache reference image information. However, such a technique is already widely used in both the research community and industrial applications. It appears more like a trick rather than a novel algorithm, which is my primary concern. At the same time, compared to Anyfit [1], this paper does introduce an approach for simultaneous multi-garment replacement. The experimental results st

Reviewer 02Rating 6Confidence 4

Strengths

1. The writing is clear and the methodology is easy to understand. 2. The ablation studies thoroughly validate each component of the proposed approach—I appreciate the completeness of the ablation analysis. 3. In fact, the primary bottleneck preventing prior models from supporting multi-reference outfit composition has been the lack of suitable datasets. The most significant contribution of this paper is the introduction of the new multi-garment virtual try-on dataset, DressCode-MR. However, t

Weaknesses

1. The main concern is limited novelty. Multi-garment try-on has already been explored in prior works such as MMTryon [1] and AnyFit [2], which adopt similar strategies—e.g., spatially concatenating multiple garment references. In my view, using dedicated trainable branches for conditions versus sharing parameters with the denoising branch is primarily an engineering-level optimization rather than a fundamental academic distinction. One could interpret a single denoising branch handling both con

Reviewer 03Rating 4Confidence 5

Strengths

- Most existing VTON methods (e.g., CatVTON, Chong et al. 2024; FitDiT, Jiang et al. 2024) only support single-garment try-on, requiring sequential inference for multi-item outfits (introducing error accumulation and latency). FastFit is among the first to enable simultaneous composition of garments and accessories (including shoes and bags), a critical capability for realistic outfit visualization. Qualitative results confirm it preserves fine details (e.g., text logos on T-shirts, sheer fabric

Weaknesses

- The paper’s focus on multi-reference VTON overlaps with OmniTry (a previously published work on "try-on anything" that supports diverse garment/accessory categories). FastFit does not explicitly compare with OmniTry or demonstrate unique advantages in category coverage, generality, or composition flexibility—undermining its claim of advancing multi-reference VTON. This overlap reduces the work’s incremental contribution. - The paper fails to cite key recent VTON research, such as VTON-HandFit

Code & Models

Models

Datasets

zhengchong/DressCode-MR
dataset· 187 dl
187 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Machine Learning in Healthcare

Full text

FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

Zheng Chong1,2,3, Yanwei Lei1, Shiyue Zhang1, Zhuandi He1, Zhen Wang1, Xujie Zhang1,

Xiao Dong1, Yiling Wu3, Dongmei Jiang3 & Xiaodan Liang1,3

1Sun Yat-sen University 2LavieAI 3Pengcheng Laboratory

{chongzheng98,dx.icandoti,xdliang328}@gmail.com,

{leiyw5,zhangshy223,zhuandihe86,wangzh669,zhangxj59}@mail2.sysu.edu.cn,

{wuyl02,jiangdm}@pcl.ac.cn Corresponding author. Project page: https://github.com/Zheng-Chong/FastFit.

Abstract

Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5 $\times$ speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.

1 Introduction

Generative AI-based virtual try-on has recently made remarkable progress. An ideal virtual try-on system—one that could revolutionize online retail and power applications like intelligent outfit visualization—would allow users to seamlessly mix and match various garments and accessories, rapidly generating photorealistic results to enable an interactive experience. However, two major challenges hinder current methods from achieving this vision. Firstly, most existing methods (Xie et al., 2022; Wang et al., 2018; Xu et al., 2024; Choi et al., 2024; Chong et al., 2024; Jiang et al., 2024) are designed for a single reference garment (e.g., a top or a dress), requiring a complete multi-item outfit to be rendered through iterative passes, leading to both inflated computation time and the risk of accumulated synthesis errors. Furthermore, the general lack of support for essential accessories like shoes and bags prevents the generation of truly holistic and realistic outfits. Secondly, the computational inefficiency of current methods stems from two competing yet flawed strategies, as illustrated in Figure 2. On one hand, ReferenceNet-based methods (Huang et al., 2024b; Choi et al., 2024; Xu et al., 2024; Zhang et al., 2024b; Zhou et al., 2024; Jiang et al., 2024) employ a separate network to encode references (Figure 2 (a)), which avoids this redundancy but at the cost of substantial parameter overhead, increasing both training and inference costs. On the other hand, in-context learning-based methods (Guo et al., 2025; Chong et al., 2024; Huang et al., 2024a) repeatedly process the concatenated reference and person features at each of the $N$ denoising steps (Figure 2 (b)), causing significant computational redundancy.

To overcome these limitations, we introduce FastFit, a high-speed framework that enables coherent multi-reference virtual try-on through a novel cacheable diffusion architecture. Our proposed Cacheable UNet decouples the reference feature encoding from the iterative denoising process, which is achieved by introducing a Reference Class Embedding and a Semi-Attention mechanism. This structure enables a Reference KV Cache during inference, which allows reference features to be computed only once and losslessly reused in all subsequent steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5 $\times$ speedup over comparable methods with negligible parameter overhead. Furthermore, observing the lack of datasets with complete outfit pairings, we construct DressCode-MR, a large-scale multi-reference try-on dataset based on Morelli et al. (2022). We developed a data-generation pipeline that trains expert models based on Chong et al. (2024) and Labs (2024) to recover canonical images of individual items, and utilizes human feedback to ensure high quality. This results in 28,179 multi-reference image sets spanning five key categories: tops, bottoms, dresses, shoes, and bags.

In summary, the contributions of this work include:

•

We propose FastFit, a novel framework for high-speed, multi-reference virtual try-on. It is the first to enable coherent multi-reference virtual try-on across five key categories, including tops, bottoms, dresses, shoes, and bags, while achieving an average 3.5 $\times$ speedup over comparable methods.

•

We design a novel Cacheable UNet structure featuring a Reference Class Embedding and a Semi-Attention mechanism. This design decouples reference feature encoding from the denoising process, enabling a lossless Reference KV Cache that breaks the core efficiency bottleneck of subject-driven generation architectures.

•

We construct DressCode-MR, the first large-scale dataset specifically for multi-reference virtual try-on. It comprises 28,179 high-quality image sets, providing a solid foundation to foster future research in complex outfit generation.

•

We conduct extensive experiments on VITON-HD, DressCode, and our DressCode-MR benchmarks, demonstrating that FastFit surpasses state-of-the-art methods in image fidelity while maintaining its significant efficiency advantage.

2 Related Work

2.1 Subject-Driven Image Generation

To enable finer-grained control in diffusion models for image generation, the research community has rapidly shifted towards subject-driven image generation. Early efforts primarily centered on single reference images, injecting specific subject identities or artistic styles by fine-tuning model weights (Ruiz et al., 2022; Yang et al., 2022; Hu et al., 2021; Huang et al., 2024a) or utilizing lightweight adapters (Ye et al., 2023; Mou et al., 2023; Chen et al., 2023). However, the former approach requires training a separate model for each subject, limiting its practical flexibility, while the latter, despite being convenient, often faces challenges in maintaining high fidelity to the reference image. Another line of work based on in-context learning, such as IC-LoRA (Huang et al., 2024a) and OminiControl (Tan et al., 2025b; a), achieves superior detail preservation by concatenating the reference image with noise along the spatial dimension. The trade-off is that the reference must participate in every denoising step, significantly increasing inference time and computational cost. The limitations of these single-reference approaches become apparent when creative needs involve composing elements from multiple, diverse sources. Consequently, some works have begun to explore multi-reference generation; for instance, IC-Custom (Li et al., 2025) inputs multiple images as a single concatenated map for multi-concept composition, Face-diffuser investigates the complex multi-person synthesis task, and MultiRef (Chen et al., 2025) provides the first systematic definition and benchmark for this task. Nevertheless, in the domain of virtual try-on, multi-reference generation remains an under-explored area. How to harmoniously compose visual information from multiple references while mitigating the heightened computational burden from increased inputs remains a significant and open challenge.

2.2 Image-based Virtual Try-On

Image-based virtual try-on aims to realistically synthesize a person wearing target garments. Classic paradigms centered on a warp-and-fuse method, which explicitly deforms the garment using either geometric transformations or learned appearance flows before the blending stage (Wang et al., 2018; Han et al., 2018; Choi et al., 2021; Han et al., 2019; Ge et al., 2021; Xie et al., 2021; 2023; Gou et al., 2023; Chong & Mo, 2022); however, these approaches are frequently hampered by visual artifacts from inaccurate warping. Subsequently, the advent of diffusion models revolutionized the field by reframing the task as end-to-end conditional image generation, bypassing the error-prone warping step. The dominant strategy in these modern models involves injecting high-fidelity garment features into the denoising process via sophisticated conditioning mechanisms, such as parallel encoder branches (i.e., ReferenceNets) or ControlNet(Zhang et al., 2023)-like structures, a technique employed by a vast body of recent work (Zhu et al., 2023; Morelli et al., 2023; Kim et al., 2023; Xu et al., 2024; Wang et al., 2024; Choi et al., 2024; Sun et al., 2024; Zhou et al., 2024; Zhang et al., 2024a; Kim et al., 2024). Recent innovations further push the boundaries by exploring alternative backbones like Diffusion Transformers (Peebles & Xie, 2022) or introducing novel control modalities such as textual prompts and more generalized conditioning schemes (Guo et al., 2025; Jiang et al., 2024). Despite achieving unprecedented realism, their inference speed and general limitation to single garments have become key bottlenecks, hindering the technology’s application in real-world scenarios that demand rapid feedback and multi-item outfit composition.

3 Methods

3.1 Overview

The overall framework of FastFit is built upon the foundation of Latent Diffusion Models (LDMs) (Rombach et al., 2021) and is designed to achieve high-speed, multi-reference virtual try-on through a novel conditioning cacheable UNet architecture. The entire workflow is depicted in Figure 4 (a). To ensure the generated image preserves the person’s identity and pose while accurately rendering the new garments, we prepare two sets of conditions:

Person Conditioning $c_{p}$ : To accurately preserve the person’s identity and body pose, we construct the person condition $c_{p}$ . First, we utilize AutoMask (Chong et al., 2024) to generate a cloth-agnostic mask $\mathbf{M_{a}}$ from the input image $I_{p}$ . Subsequently, a composite image, $\mathbf{I_{comp}}$ , is created by combining the human pose skeleton extracted via DWPose (Yang et al., 2023) with the person image masked by $M_{a}$ . $c_{p}$ is formed as:

[TABLE]

where $\mathcal{E}$ is the VAE encoder, Interpolate is a downsampling function that resizes the mask $M_{a}$ , and Concat denotes the channel-wise concatenation.

Reference Conditioning $\{R_{i}\}_{i=1}^{K}$ : To capture the detailed appearance of the target garments, we extract a set of reference latents $\{R_{i}\}_{i=1}^{K}$ from the corresponding reference images $\{I_{R_{i}}\}_{i=1}^{K}$ , which is defined as:

[TABLE]

The image generation process is guided by a denoising UNet $\epsilon_{\theta}$ , which predicts the noise $\tilde{\epsilon}_{t}$ at each timestep $t$ . As illustrated in Figure 4 (a), our key innovation is to conceptually partition the function of $\epsilon_{\theta}$ into two streams: a time-independent path for reference inputs and a time-dependent path for the denoising process. Specifically, each reference latent $R_{i}$ is processed individually by a dedicated, time-independent path within the UNet, conditioned only on its corresponding Class Embedding $E_{i}$ . This allows us to pre-compute and cache a separate feature representation, $\mathcal{R}_{\text{cache}}^{(i)}$ , for each item before the denoising loop begins. This operation is performed for all $i\in\{1,\dots,K\}$ and is independent of any timestep $t$ :

[TABLE]

The resulting set of cached features, $\{\mathcal{R}_{\text{cache}}^{(i)}\}_{i=1}^{K}$ , is then collectively used in each step of the main denoising loop.

The main denoising loop then proceeds for $N$ steps. At each step $t$ , the UNet, $\epsilon_{\theta}$ , processes only the time-dependent inputs: the noisy latent $z_{t}$ , the person condition $c_{p}$ , and the timestep embedding $\gamma(t)$ . It integrates the static reference information by attending to the pre-computed set of cached features, $\{\mathcal{R}_{\text{cache}}^{(i)}\}_{i=1}^{K}$ , via a Semi-Attention mechanism (detailed in Section 3.2):

[TABLE]

This decomposition of the denoising process is the key to FastFit’s efficiency, as it shifts the expensive computation for multiple reference features entirely out of the iterative loop. Once the process concludes at $t=0$ , the final clean latent, $z_{0}$ , is mapped back to the pixel space using the VAE decoder $\mathcal{D}$ , to produce the high-resolution output image, $I_{\text{out}}$ :

[TABLE]

3.2 Cacheable UNet for Efficient Conditioning

The primary bottleneck in existing subject-driven diffusion models is the repeated computation of reference features at every denoising step. This is because the reference conditioning is typically dependent on the timestep $t$ , making the features dynamic. Our key innovation, the Cacheable UNet, fundamentally breaks this dependency, enabling reference features to be computed once and reused. This is achieved through two core components: Reference Class Embedding and a Semi-Attention mechanism, as illustrated in Figure 4 (b).

Reference Class Embedding.

To decouple the reference features from the denoising timestep $t$ , we replace the conventional timestep embedding with a static, learnable Reference Class Embedding for the reference items. Specifically, for a set of $K$ reference items $\{R_{1},\dots,R_{K}\}$ , each belonging to a certain category (e.g., ’top’, ’shoes’), we introduce a corresponding set of learnable class embeddings $\{E_{1},\dots,E_{K}\}$ . The features for each reference item $R_{i}$ are conditioned on its class embedding $E_{i}$ instead of the shared timestep embedding $\gamma(t)$ used by the denoising features $X$ . The reference class embedding is injected in the same manner as the timestep embedding; both modulate the features within the ResNet blocks through an scaling operation. Since the class embeddings are constant throughout the entire denoising process, the resulting reference features become static and independent of the current timestep $t$ , making them inherently cacheable.

Semi-Attention Mechanism.

Having made reference features static, we need a mechanism to inject their information into the denoising process without compromising their static nature. A standard full self-attention would allow information to flow from the step-dependent denoising features $X$ back to the reference features $R_{i}$ , thereby ”contaminating” them and breaking the condition for caching. To solve this, we propose a Semi-Attention mechanism, visualized in Figure 5. In this design, we treat both the denoising features $X$ and all reference features $\{R_{i}\}$ as a single sequence of tokens. The attention calculation is governed by a specific mask that controls the information flow: (1) Denoising-to-All: The tokens of the denoising feature $X$ are allowed to attend to all tokens in the sequence (i.e., to itself and to all reference features $R_{i}$ ). This allows the model to effectively ”read” the appearance information from each garment and apply it to the person. (2) Reference-to-Self: The tokens of each reference feature $R_{i}$ are only allowed to attend to themselves. They cannot attend to the denoising features $X$ or to any other reference feature $R_{j}$ (where $j\neq i$ ). This attention mask ensures that the reference features act as a static, read-only source of information for the denoising process. Their representations are never updated by the dynamic features of $X$ , thus preserving their cacheability across all timesteps.

In summary, the Reference Class Embedding makes the computation of reference features static, while the Semi-Attention mechanism ensures that during interaction, the static reference features only provide information without being affected by the denoising process. This synergistic design forms the Cacheable UNet architecture, laying the foundation for an efficient, cache-based inference pipeline.

3.3 Inference Acceleration with Reference KV Cache

The design of our Cacheable UNet enables a highly efficient inference pipeline via a Reference KV Cache. As depicted in Figure 4(b), the process is split into two stages:

Pre-computation and Caching (One-time Cost).

Before the iterative denoising loop begins, we perform a single forward pass for each reference item $R_{i}$ through the UNet $\epsilon_{\theta}$ . For each Semi-Attention layer, we then compute and store its corresponding Key ( $K_{i}^{\text{cache}}$ ) and Value ( $V_{i}^{\text{cache}}$ ) matrices. This pre-computation step is performed only once per generation request.

Accelerated Denoising Loop.

For every subsequent denoising step from $t=N-1$ down to [math], we completely bypass the computation for the reference branches. Instead, for each Semi-Attention layer, we only compute the Query ( $Q_{X}$ ), Key ( $K_{X}$ ), and Value ( $V_{X}$ ) from the current denoising features $X_{t}$ . We then construct the full key and value matrices, $K_{\text{full}}$ and $V_{\text{full}}$ , by concatenating these dynamic tensors with all the cached keys $\{K_{i}^{\text{cache}}\}_{i=1}^{K}$ and values $\{V_{i}^{\text{cache}}\}_{i=1}^{K}$ , respectively:

[TABLE]

The final attention output is then calculated only for the denoising query $Q_{X}$ :

[TABLE]

This strategy effectively reduces the computational cost of attention at each step to be dependent only on the denoising features, regardless of the number or complexity of reference items. This fundamentally resolves the efficiency bottleneck, leading to a substantial reduction in inference latency, especially in the multi-reference setting central to our work.

4 Experiments

4.1 Datasets

We evaluate our model on three datasets, VITON-HD (Choi et al., 2021), DressCode (Morelli et al., 2022), and our newly proposed DressCode-MR, all at 1024×768 resolution. VITON-HD (Choi et al., 2021) provides 13,679 image pairs for upper-body virtual try-on (11,647 train / 2,032 test). DressCode (Morelli et al., 2022) dataset features 53,792 full-body pairs (48,392 train / 5,400 test) covering tops, bottoms, and dresses. To facilitate multi-reference research, we introduce DressCode-MR, built upon DressCode. As illustrated in Figure 3, it contains 28,179 samples (25,779 train / 2,400 test), each pairing a person with a complete outfit from up to five categories: tops, bottoms, dresses, shoes, and bags. We constructed this dataset by training five expert restoration models (based on CatVTON (Chong et al., 2024) and FLUX (Labs, 2024)) using VITON-HD, DressCode, and a small set of internet-sourced shoe and bag pairs. These models were used to recover the canonical images for items worn in DressCode, and the final high-quality samples were selected through human feedback.

4.2 Implementation Details

We train our single-reference try-on model based on the pretrained StableDiffusion v1.5 (Rombach et al., 2021) inpainting on the DressCode (Morelli et al., 2022) and VITON-HD (Choi et al., 2021) datasets for 64,000 steps with a batch size of 32 and a resolution of 1024 $\times$ 768. This version is used for all single-reference quantitative evaluations. Building upon the single-reference model, we fine-tune it on our proposed DressCode-MR dataset for 16,000 steps with the same resolution and batch size. We utilized the AdamW (Loshchilov & Hutter, 2019) optimizer with a constant learning rate of $1\times 10^{-5}$ for both training stages. To enable classifier-free guidance, 20% of the reference images were randomly dropped during the training. All experiments were conducted on 8 NVIDIA H100 GPUs.

4.3 Metrics

We evaluate our model’s performance on two fronts: image fidelity and computational efficiency.

Image Fidelity. We use two settings. In the paired setting, where ground-truth images are available, we measure similarity using the Structural Similarity Index (SSIM) (Wang et al., 2004), Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018), Fréchet Inception Distance (FID) (Seitzer, 2020), and Kernel Inception Distance (KID) (Bińkowski et al., 2021). In the unpaired setting, we assess overall realism and diversity by comparing the distribution of our generated samples to that of real images using FID and KID.

Computational Efficiency. We report the total parameters, inference latency, and peak memory usage. These metrics are benchmarked by averaging 100 runs on a single NVIDIA H100 GPU, with each run configured for 20 denoising steps and with classifier-free guidance (CFG) (Ho & Salimans, 2022) enabled.

4.4 Quantitative Comparison

Single-Reference Virtual Try-On.

We conducted a quantitative comparison against current state-of-the-art virtual try-on methods (Guo et al., 2025; Kim et al., 2024; Jiang et al., 2024; Choi et al., 2024; Chong et al., 2024; Xu et al., 2024) on VITON-HD (Choi et al., 2021) and DressCode (Morelli et al., 2022) datasets. As shown in Table 3, FastFit achieves competitive results across both datasets under paired and unpaired settings, demonstrating its superior capability in generating high-quality images. Table 1 highlights the efficiency of FastFit, which achieves an average 3.5 $\times$ speedup over comparable methods while remaining competitive in terms of parameters and memory usage.

Multi-Reference Virtual Try-On.

Table 2 shows our multi-reference try-on results. In the absence of methods designed for simultaneous multi-reference generation, we adapt strong baselines from subject-driven generation (Ye et al., 2023; Yang et al., 2022; Chen et al., 2023; 2024) and multi-category try-on (Jiang et al., 2024; Chong et al., 2024; Huang et al., 2024b) via sequential single-reference inference. FastFit achieves state-of-the-art scores across quality metrics and is also the most efficient method. This demonstrates its superior ability to cohesively synthesize multiple references with high fidelity.

4.5 Qualitative Comparison

Single-Reference Virtual Try-On. Figure 7 shows the qualitative comparison for the single-reference try-on task. On the VITON-HD (Choi et al., 2021) dataset, our method excels at preserving fine-grained details, such as the text “REBEL” on T-shirts, where other methods often produce blurred results. FastFit also realistically renders challenging materials, like the sheer polka-dot top. On the DressCode (Morelli et al., 2022) dataset, our model accurately captures the correct shape and style of complex garments like the high-slit dress.

Multi-Reference Virtual Try-On. We further evaluate FastFit on the more challenging multi-reference virtual try-on task, with results presented in Figure 6. The comparison clearly demonstrates our model’s superior capability. FastFit successfully synthesizes a coherent and realistic final image by seamlessly combining multiple reference items. In contrast, most existing methods, such as AnyDoor (Chen et al., 2023) and PBE (Yang et al., 2022), often fail to properly compose the different garments or produce significant artifacts. Our method, however, maintains the identity and details of each piece of clothing, resulting in a natural and believable complete outfit.

4.6 Ablation Studies

The results in Table 4 validate our key design choices. Firstly, the Reference KV Cache is crucial for efficiency; disabling it increases inference time from 1.16s to 1.92s, yet this $\sim$ 1.66 $\times$ speedup comes with no loss in generation quality, as the performance metrics are identical. Secondly, our parameter-sharing strategy is highly effective. Introducing a separate ReferenceNet nearly doubles the parameters (904.86M $\rightarrow$ 1729.9M) and increases memory usage, but yields no corresponding performance improvement. Furthermore, replacing Semi-Attention with Full Attention is detrimental, as it not only slows inference to 2.17s but also degrades generation quality (e.g., FID increases to 3.1847). We hypothesize this is because full interaction disrupts the consistency of reference features. Finally, removing the Class Embedding causes a slight performance drop, and its effectiveness in guiding region-specific attention is presented in Section A.3. All ablation experiments follow the settings described in Section 4.2, trained for 32K steps, and are evaluated on the DressCode (Morelli et al., 2022) dataset.

5 Conclusion

In this paper, we proposed FastFit, a high-speed multi-reference virtual try-on framework designed to break the critical trade-offs between versatility, efficiency, and quality in existing technologies. Through an innovative Cacheable UNet, which combines a Class Embedding and a Semi-Attention mechanism, we decoupled reference feature encoding from the denoising process. This design enables a Reference KV Cache that allows reference features to be computed once and reused losslessly across all steps, fundamentally eliminating the computational redundancy that plagues current methods. Experimental results show that FastFit achieves a significant efficiency advantage—an average 3.5 $\times$ speedup over comparable methods—without sacrificing generation quality. For the first time, it enables coherent, synergistic try-on for up to 5 key categories: tops, bottoms, dresses, shoes, and bags. Furthermore, the DressCode-MR dataset we constructed provides a valuable foundation for future research in complex outfit generation. In summary, FastFit represents a promising advance towards a more realistic, efficient, and diverse virtual try-on experience, significantly lowering the barriers for its widespread application in e-commerce and intelligent outfit visualization.

Limitations and Future Work.

Despite the model’s strong performance, several areas present opportunities for future exploration. To further enhance realism, the modeling of complex physical interactions and layering among garments could be improved. Expanding the DressCode-MR dataset with such complex interaction pairs would be a valuable direction. Another important research path is improving generalization to underrepresented apparel, such as styles with unique topologies or challenging materials. Finally, while our framework significantly accelerates inference, a gap remains toward achieving real-time interaction. Exploring techniques such as guidance and step distillation, combined with more advanced caching mechanisms, offers a promising path to bridge this gap and enable applications like interactive real-time outfit visualization.

Appendix A Appendix

A.1 Quantitative Comparison across Garment Types

For a more fine-grained analysis, Table 5 presents a quantitative comparison on the DressCode (Morelli et al., 2022) dataset, with results broken down by clothing category. The results highlight the robust and superior performance of our method across all tested categories, including upper, lower, and dresses. FastFit consistently achieves either the best or second-best scores in the vast majority of key metrics, demonstrating its strong and stable performance regardless of the garment type. This showcases the model’s excellent generalization capability for different clothing styles.

A.2 More Visual Comparisons

A.2.1 Single-Reference Virtual Try-On

In the single-reference virtual try-on task, our method demonstrates robust performance across both the VITON-HD (Choi et al., 2021) and DressCode (Morelli et al., 2022) datasets. As illustrated in Figure 8, FastFit excels at preserving high-frequency details on the garments, such as intricate patterns and text logos. Furthermore, our model accurately renders the correct shape and length for various types of clothing. The final results show that the garments are naturally fused with the person’s body, effectively handling challenging poses and occlusions.

A.2.2 Multi-Reference Virtual Try-On

For the more challenging multi-reference task, FastFit exhibits a significant advantage over competing methods. Figure 9 showcases our model’s unique ability to seamlessly combine multiple, distinct reference items into a single, coherent outfit. Notably, even during this complex composition process, FastFit faithfully preserves the fine-grained details and logos of each individual item (e.g., ”SPAS”, ”CHIUS”). This capability to generate complete and detailed ensembles in complex scenarios highlights its superiority where other methods often struggle.

A.3 Visual Analysis of the Effectiveness of Class Embeddings

To visually validate the effectiveness of the Reference Class Embedding as a key control mechanism in our model, we conducted an additional ablation study. As shown in Figure 10, the experiment is designed to isolate the influence of the class embedding. For each example row, we provide the model with the exact same source person and reference image. The only variable changed across the columns is the specific class embedding provided (e.g., ’Upper’, ’Lower’, ’Dresses’, ’Shoes’, ’Bag’). The results demonstrate that the class embedding provides fine-grained, semantic control over the try-on process. The model is able to interpret the embedding and selectively transfer the corresponding item from the reference image, even when multiple items are present. This experiment confirms that by applying a Class Embedding, the model’s attention is effectively guided to the corresponding region of the reference image, which is crucial for preventing the features of different reference items from being conflated in a multi-reference scenario.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bińkowski et al. (2021) Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans, 2021. URL https://arxiv.org/abs/1801.01401 .
2Chen et al. (2025) Ruoxi Chen, Siyuan Wu, Dongping Chen, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Sinan Wang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual references. In Synthetic Data for Computer Vision Workshop @ CVPR 2025 , 2025. URL https://openreview.net/forum?id=T Zw QU 36K Fd .
3Chen et al. (2023) Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. ar Xiv preprint ar Xiv:2307.09481 , 2023.
4Chen et al. (2024) Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, and Hengshuang Zhao. Zero-shot image editing with reference imitation, 2024. URL https://arxiv.org/abs/2406.07547 .
5Choi et al. (2021) Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR) , 2021.
6Choi et al. (2024) Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for virtual try-on. ar Xiv preprint ar Xiv:2403.05139 , 2024.
7Chong & Mo (2022) Zheng Chong and Lingfei Mo. St-vton: Self-supervised vision transformer for image-based virtual try-on. Signal Processing , 2022.
8Chong et al. (2024) Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models, 2024. URL https://arxiv.org/abs/2407.15886 .