SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

Kosuke Takemoto; Takafumi Koshinaka

arXiv:2605.01296·cs.CV·May 5, 2026

SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

Kosuke Takemoto, Takafumi Koshinaka

PDF

1 Repo

TL;DR

SIFT-VTON introduces explicit geometric supervision using SIFT keypoints to improve the accuracy and detail preservation in diffusion-based virtual try-on, outperforming previous implicit methods.

Contribution

The paper presents a novel method that integrates SIFT keypoint matching as explicit geometric guidance into diffusion models for virtual try-on, enhancing detail preservation and alignment.

Findings

01

Significant improvements on unpaired metrics in VITON-HD dataset

02

Better preservation of text and pattern details in qualitative results

03

Focused attention on relevant garment regions confirmed by visualization

Abstract

Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

takesukeDS/SIFT-VTON
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.