GIM: Learning Generalizable Image Matcher From Internet Videos

Xuelun Shen; Zhipeng Cai; Wei Yin; Matthias M\"uller; Zijun Li,; Kaixuan Wang; Xiaozhi Chen; Cheng Wang

arXiv:2402.11095·cs.CV·February 20, 2024·2 cites

GIM: Learning Generalizable Image Matcher From Internet Videos

Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias M\"uller, Zijun Li,, Kaixuan Wang, Xiaozhi Chen, Cheng Wang

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

GIM is a self-training framework that leverages internet videos to learn a single, generalizable image matching model capable of zero-shot cross-domain performance, significantly improving robustness and applicability.

Contribution

The paper introduces GIM, a novel self-training method that uses internet videos to train a universal image matcher, overcoming data diversity and generalization limitations of prior approaches.

Findings

01

GIM improves zero-shot performance by up to 18.1% on diverse benchmarks.

02

The method enables cross-domain generalization, including to Bird Eye View images.

03

A new zero-shot evaluation benchmark, ZEB, is proposed for assessing generalization.

Abstract

Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

Many of the current deep networks suffer from poor generalization ability to unknown data distributions when the amount and diversity of training data are limited. Fine-tuning the model on the target data distribution with a small amount of data from the target domain with GT supervision is a natural way. However, obtaining GT information of the data from the target domain might not always be easy, especially in correspondence matching, pose estimation, 3D reconstruction, etc. To address this

Weaknesses

My comments below are more like questions instead of weaknesses. (1) The proposed method combines a baseline image-matching network (e.g., SuperGlue, LoFTR, DKM) trained on a standard dataset and complementary image-matching methods to generate candidate correspondences. From the experiments section, the complementary image matching methods perform inferiorly than the baseline network. I have two questions here. a. Since the performance is inferior, why are they needed? Will this increas

Reviewer 02Rating 10· strong accept, should be highlighted at the conferenceConfidence 5

Strengths

The overview image in page.1 is impressive already. The method works on three strongest baseline (DKM, SuperGlue, and LoFTR) and improves them further more. It surprises me the method works with such huge view point differences and it also works with BEV pointcloud. The training is using internet videos which prevents the COLMAP (SfM + MVS) bias for a single scene. The proposed GIM is essentially a point matching ground-truth reinvention by using the label propagation through video with strong

Weaknesses

This paper is very impressive, I think the only thing left is just some implementation details becuase the self-training part is very short and only about the ground-truth instead of the training itself. The only thing left is just open-sourcing the proposed code of the label propagation and training data to verify it's accuracy.

Reviewer 03Rating 8· accept, good paperConfidence 2

Strengths

1. Simple and scalable framework: The proposed self-training framework is simple and scalable. 2. Strong zero-shot generalizability: Compared to image matchers trained on the traditional datasets, image matchers trained using self-training demonstrated stronger zero-shot generalizability, yielding more robust and performant image matchers. 3. Comprehensive experiments: Experiments include large collections of datasets and downstream tasks, showing the superiority of self-trained image matchers

Weaknesses

1. Lacking real indoor datasets in the benchmark: This is a nitpick but it would be great to have more real indoor datasets in the benchmark. Right now, most of the real datasets are driving-related and the indoor dataset only covers basements and corridors.

Code & Models

Repositories

xuelunshen/gim
pytorchOfficial

Models

🤗
xuelunshen/gim
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization