RECLIP: Resource-efficient CLIP by Training with Small Images
Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

TL;DR
RECLIP introduces a resource-efficient training method for CLIP that uses small images for initial learning and fine-tunes with high-resolution data, significantly reducing computational costs while maintaining competitive performance.
Contribution
The paper proposes RECLIP, a novel approach that drastically reduces training resources for CLIP by leveraging small images and a coarse-to-fine training strategy, enabling resource-efficient large-scale pretraining.
Findings
Achieves 6-8x less computational resources with comparable accuracy.
Demonstrates 5-59x training resource savings over state-of-the-art methods.
Matches state-of-the-art in open-vocabulary detection with 32 APr on LVIS.
Abstract
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the vision transformer heavily depends on input image size, our approach significantly reduces the training resource requirements both in theory and in practice. Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-text retrieval accuracy with 6 to 8x less computational resources and 7 to 9x fewer FLOPs than the baseline. Compared to the state-of-the-art contrastive learning methods, RECLIP demonstrates 5 to 59x training resource savings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Layer Normalization · Residual Connection · Dense Connections · Contrastive Learning · Multi-Head Attention · Contrastive Language-Image Pre-training · Vision Transformer
