Pre-training with Random Orthogonal Projection Image Modeling
Maryam Haghighat, Peyman Moghadam, Shaheer Mohamed, Piotr Koniusz

TL;DR
This paper introduces ROPIM, a novel self-supervised image pre-training method using random orthogonal projections instead of traditional masking, achieving superior results on benchmarks.
Contribution
Proposes ROPIM, a new masking approach based on orthogonal projections, improving over crop-based masking in masked image modeling.
Findings
ROPIM outperforms crop-based masking in experiments.
Achieves state-of-the-art results on multiple benchmarks.
Reduces spatial information with controlled noise variance.
Abstract
Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn structural information about objects and scenes. The intermediate feature representations obtained from MIM are suitable for fine-tuning on downstream tasks. In this paper, we propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. Our proposed Random Orthogonal Projection Image Modeling (ROPIM) reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees. Since ROPIM uses a random subspace for the projection that realizes…
Peer Reviews
Decision·ICLR 2024 spotlight
+ The use of linear algebraic projection technique in both the method and the loss function for reconstruction. + The results are achieved better with a considerably smaller number of epochs. + The work is well-motivated from the basics and seems reproducible. + The transfer learning results are added value to the work as such. + The work should be useful as a pre-trainer for several ViT based applications.
- I'm not sure if all the recent works on MIM have been compared with. Authors are requested to comment on this.
Though simple and straightforward, to my knowledge the proposed random orthogonal projection modeling for self-supervised learning is novel. Provided experimental results have demonstrated better performance of the proposed random orthogonal projection method in comparison with the masked image modelling using crop-based masking.
The proposed method is somewhat heuristic.
1. This work is based on the sound theory of random orthogonal projection. 2. ROPIM is able to achieve more superior performance in a shorter pre-training time. 3. The decoder only contains one linear layer, being more slight comparing with MIM. 4. The experiments verify its effectiveness on several downstream tasks, including classification and segmentation.
1. It is a little difficult to understand why does the proposed method is more superior than MIM. According to my understanding, the ROP strategy randomly discards some local (not global) patterns during corruption as shown in Fig. 9. This is very similar to MIM. So, I can't intuitively catch what results in the superiority of ROP in the field of MIM. It would be better to have a discussion in the paper.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Advanced Vision and Imaging
MethodsMutual Information Machine/Mask Image Modeling
