TL;DR
This paper introduces a novel approach combining a shallow convolutional tokenizer and latent space predictive architecture to enable effective self-supervised vision representation learning on small datasets, reducing reliance on large-scale data and resources.
Contribution
The authors propose SCOTT and MIM-JEPA, enabling vision transformers to learn from limited data without extensive pretraining on large datasets.
Findings
Outperforms fully supervised methods on small datasets
Achieves competitive results with large-scale pretraining methods
Enables training of effective models with limited data and compute
Abstract
The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper addresses an important problem: enabling model training on small-scale, unlabeled datasets, which is critical for advancing self-supervised learning in data-limited settings. 2. The authors conduct extensive experiments using multiple datasets.
1. The contributions of the proposed methods appear incremental compared to previous work. 2. The evaluation and comparisons with baseline and prior methods seem unfair due to differences in training setups. 3. The writing quality could be improved for clarity and readability. Please see my comments below for further details.
1. This paper is easy to follow and has clean organization. 2. This topic is promising since training Transformer-based vision model is very data thirsty.
1. The comparison experiments in the paper is weak since there are tons of conv+ViT baselines. This paper, however, only compare to a few, also the related works missed many related references. Therefore, the paper’s experiments is not quite convincing. 2. The motivation is clear but this paper lacks the analysis of related works. What kind of problems are there for similar design? Why the proposed method is better? Why choosing such design (e.g., MIM-JEPA)? The overall elaboration is not quite
1. The integration of convolutional biases through SCOTT and the focus on semantic feature extraction via MIM-JEPA can shift away from the reliance on extensive pre-training datasets. 2. The proposed methods outperform fully supervised methods and achieve results competitive with state-of-the-art models pre-trained on much larger datasets.
1. The authors claim that the datasets used are high-resolution; however, I believe these datasets should not be considered high resolution. (Of course, compared to low-resolution CIFAR and MNIST, there are). I suggest that the authors also include results from higher, domain-specific resolution datasets, as well as from low-resolution datasets, to provide a more comprehensive analysis of performance variations across different resolutions. 2. The methodology appears to be primarily limited to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMutual Information Machine/Mask Image Modeling
