MIMIC: Masked Image Modeling with Image Correspondences
Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick, Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro,, Ranjay Krishna

TL;DR
This paper introduces MIMIC, a scalable, annotation-free pretraining dataset from real-world videos and simulated environments, enabling improved dense geometric and object understanding tasks without reliance on costly annotations.
Contribution
Proposes a novel dataset curation method for multi-view image data that does not require annotations, enabling large-scale pretraining for dense vision tasks.
Findings
Models trained on MIMIC-3M outperform those trained on ImageNet-1K and MULTIVIEW-HABITAT on geometric tasks.
MIMIC-3M improves performance on dense tasks like depth, normals, and segmentation.
Larger datasets like MIMIC-3M lead to better downstream task performance.
Abstract
Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
[Writing quality] The paper is well-written and easy to follow. Figure 1 illustrates the overview of the data curation process used to build MIMIC in detail. The authors also provided implementation details including hyper-parameters used for experiments. [Soundness of the method] The proposed curation approach seems reasonable. As the classical matching algorithm using SIFT features does not require training, it is generalizable and suitable for matching data from multiple data sources. The ef
[Overlap measurement] The authors used a patch size 16, consistent with the size used for ViT, to determine the overlap between two images. It raises the question of whether a mismatch between the patch size used for overlap measurement and the one used for masked image pre-training could present challenges. For instance, is it feasible to conduct pre-training on a model using a patch size of 32 on MIMIC (while the image pairs are computed using a patch size 16)? How much the performance would d
- Being able to mine large data for pairs is a relevant task in representation learning. - They outperform multiview habitat on their evaluations. - The method, being based on classical techniques like SIFT and RANSAC, should scale well. - The method is simple but effective.
- (minor) some exposition on what the - This paper might be better suited for a computer vision venue - The paper targets dense vision tasks, but it would be interesting to see the method used to generate pairs for constrastive learning, as well as evaluations on non-dense tasks such as imagenet finetuning/linear probe accuracy.
1. The paper proposes a way to select image pairs from existing datasets that could potentially train better models for dense prediction tasks. 2. The paper is in general easy to read. The paper contains details on the implementation. 3. The paper shows promising results on multiple benchmarks such as NYU-v2 and Taskonomy surface normal.
1. For depth estimation, the model is only tested on NYU-v2, which is also a dataset containing mostly indoor scenes. So I feel that the current experiments are not convincing enough to support the claim that pre-training on the proposed dataset is better for depth estimation in-the-wild. How about testing on datasets that contain more general images, such as KITTI, TUM RGBD, Sintel? 2. Since the paper claims that pre-training on the constructed MIMIC-3M dataset is better for dense prediction,
+ The authors visit several datasets to construct the real part. This is an effort I appreciate.
- The evaluation part lacks empirical sginificance. I think the most interesting thing about CroCo is to train useful representations for geometric tasks. Only experiments on NYU depth and Taskonomy subset normal are conducted. More geometric tasks are suggested, including single view pose regression, two-view correspondence, other geometric understanding tasks on Taskonomy like occlusion edge, single-view reconstruction for objects and scenes. - Significance on non-geometric tasks like ADE and
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Cell Image Analysis Techniques
