Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Yuhan Liu; Jingwen Fu; Yang Wu; Kangyi Wu; Pengna Li; Jiayi Wu; Sanping Zhou; Jingmin Xin

arXiv:2507.10318·cs.CV·July 15, 2025

Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, Jingmin Xin

PDF

Open Access

TL;DR

This paper introduces IMD, a framework that aligns vision foundation models with image feature matching by integrating generative diffusion models and a cross-image prompting mechanism, significantly improving multi-instance matching performance.

Contribution

The paper proposes a novel framework combining diffusion models and a cross-image prompting module to address misalignment in vision foundation models for feature matching.

Findings

01

IMD achieves state-of-the-art results on standard benchmarks.

02

12% improvement on the IMIM multi-instance benchmark.

03

Effectively mitigates the misalignment issue in feature matching.

Abstract

Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques