MIMIC: Masked Image Modeling with Image Correspondences

Kalyani Marathe; Mahtab Bigverdi; Nishat Khan; Tuhin Kundu; Patrick; Howe; Sharan Ranjit S; Anand Bhattad; Aniruddha Kembhavi; Linda G. Shapiro,; Ranjay Krishna

arXiv:2306.15128·cs.CV·May 17, 2024

MIMIC: Masked Image Modeling with Image Correspondences

Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick, Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro,, Ranjay Krishna

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces MIMIC, a scalable, annotation-free pretraining dataset from real-world videos and simulated environments, enabling improved dense geometric and object understanding tasks without reliance on costly annotations.

Contribution

Proposes a novel dataset curation method for multi-view image data that does not require annotations, enabling large-scale pretraining for dense vision tasks.

Findings

01

Models trained on MIMIC-3M outperform those trained on ImageNet-1K and MULTIVIEW-HABITAT on geometric tasks.

02

MIMIC-3M improves performance on dense tasks like depth, normals, and segmentation.

03

Larger datasets like MIMIC-3M lead to better downstream task performance.

Abstract

Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

[Writing quality] The paper is well-written and easy to follow. Figure 1 illustrates the overview of the data curation process used to build MIMIC in detail. The authors also provided implementation details including hyper-parameters used for experiments. [Soundness of the method] The proposed curation approach seems reasonable. As the classical matching algorithm using SIFT features does not require training, it is generalizable and suitable for matching data from multiple data sources. The ef

Weaknesses

[Overlap measurement] The authors used a patch size 16, consistent with the size used for ViT, to determine the overlap between two images. It raises the question of whether a mismatch between the patch size used for overlap measurement and the one used for masked image pre-training could present challenges. For instance, is it feasible to conduct pre-training on a model using a patch size of 32 on MIMIC (while the image pairs are computed using a patch size 16)? How much the performance would d

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

- Being able to mine large data for pairs is a relevant task in representation learning. - They outperform multiview habitat on their evaluations. - The method, being based on classical techniques like SIFT and RANSAC, should scale well. - The method is simple but effective.

Weaknesses

- (minor) some exposition on what the - This paper might be better suited for a computer vision venue - The paper targets dense vision tasks, but it would be interesting to see the method used to generate pairs for constrastive learning, as well as evaluations on non-dense tasks such as imagenet finetuning/linear probe accuracy.

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

1. The paper proposes a way to select image pairs from existing datasets that could potentially train better models for dense prediction tasks. 2. The paper is in general easy to read. The paper contains details on the implementation. 3. The paper shows promising results on multiple benchmarks such as NYU-v2 and Taskonomy surface normal.

Weaknesses

1. For depth estimation, the model is only tested on NYU-v2, which is also a dataset containing mostly indoor scenes. So I feel that the current experiments are not convincing enough to support the claim that pre-training on the proposed dataset is better for depth estimation in-the-wild. How about testing on datasets that contain more general images, such as KITTI, TUM RGBD, Sintel? 2. Since the paper claims that pre-training on the constructed MIMIC-3M dataset is better for dense prediction,

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

+ The authors visit several datasets to construct the real part. This is an effort I appreciate.

Weaknesses

- The evaluation part lacks empirical sginificance. I think the most interesting thing about CroCo is to train useful representations for geometric tasks. Only experiments on NYU depth and Taskonomy subset normal are conducted. More geometric tasks are suggested, including single view pose regression, two-view correspondence, other geometric understanding tasks on Taskonomy like occlusion edge, single-view reconstruction for objects and scenes. - Significance on non-geometric tasks like ADE and

Code & Models

Repositories

raivnlab/mimic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Cell Image Analysis Techniques