GIM: Learning Generalizable Image Matcher From Internet Videos
Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias M\"uller, Zijun Li,, Kaixuan Wang, Xiaozhi Chen, Cheng Wang

TL;DR
GIM is a self-training framework that leverages internet videos to learn a single, generalizable image matching model capable of zero-shot cross-domain performance, significantly improving robustness and applicability.
Contribution
The paper introduces GIM, a novel self-training method that uses internet videos to train a universal image matcher, overcoming data diversity and generalization limitations of prior approaches.
Findings
GIM improves zero-shot performance by up to 18.1% on diverse benchmarks.
The method enables cross-domain generalization, including to Bird Eye View images.
A new zero-shot evaluation benchmark, ZEB, is proposed for assessing generalization.
Abstract
Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel…
Peer Reviews
Decision·ICLR 2024 spotlight
Many of the current deep networks suffer from poor generalization ability to unknown data distributions when the amount and diversity of training data are limited. Fine-tuning the model on the target data distribution with a small amount of data from the target domain with GT supervision is a natural way. However, obtaining GT information of the data from the target domain might not always be easy, especially in correspondence matching, pose estimation, 3D reconstruction, etc. To address this
My comments below are more like questions instead of weaknesses. (1) The proposed method combines a baseline image-matching network (e.g., SuperGlue, LoFTR, DKM) trained on a standard dataset and complementary image-matching methods to generate candidate correspondences. From the experiments section, the complementary image matching methods perform inferiorly than the baseline network. I have two questions here. a. Since the performance is inferior, why are they needed? Will this increas
The overview image in page.1 is impressive already. The method works on three strongest baseline (DKM, SuperGlue, and LoFTR) and improves them further more. It surprises me the method works with such huge view point differences and it also works with BEV pointcloud. The training is using internet videos which prevents the COLMAP (SfM + MVS) bias for a single scene. The proposed GIM is essentially a point matching ground-truth reinvention by using the label propagation through video with strong
This paper is very impressive, I think the only thing left is just some implementation details becuase the self-training part is very short and only about the ground-truth instead of the training itself. The only thing left is just open-sourcing the proposed code of the label propagation and training data to verify it's accuracy.
1. Simple and scalable framework: The proposed self-training framework is simple and scalable. 2. Strong zero-shot generalizability: Compared to image matchers trained on the traditional datasets, image matchers trained using self-training demonstrated stronger zero-shot generalizability, yielding more robust and performant image matchers. 3. Comprehensive experiments: Experiments include large collections of datasets and downstream tasks, showing the superiority of self-trained image matchers
1. Lacking real indoor datasets in the benchmark: This is a nitpick but it would be great to have more real indoor datasets in the benchmark. Right now, most of the real datasets are driving-related and the indoor dataset only covers basements and corridors.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
