EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
Violetta Shevchenko, Ehsan Abbasnejad, Anthony Dick, Anton van den, Hengel, Damien Teney

TL;DR
This paper compares energy-based models and contrastive learning for self-supervised visual pretraining in visual question answering, finding contrastive learning generally more stable and effective.
Contribution
It provides a systematic evaluation of EBMs versus CL for pretraining visual representations for VQA, highlighting CL's advantages.
Findings
Both EBMs and CL enable learning from unlabeled images for VQA.
CL representations improve systematic generalization and match larger supervised models.
EBMs face training instabilities and are less effective for downstream tasks.
Abstract
The availability of clean and diverse labeled data is a major roadblock for training models on complex tasks such as visual question answering (VQA). The extensive work on large vision-and-language models has shown that self-supervised learning is effective for pretraining multimodal interactions. In this technical report, we focus on visual representations. We review and evaluate self-supervised methods to leverage unlabeled images and pretrain a model, which we then fine-tune on a custom VQA task that allows controlled evaluation and diagnosis. We compare energy-based models (EBMs) with contrastive learning (CL). While EBMs are growing in popularity, they lack an evaluation on downstream tasks. We find that both EBMs and CL can learn representations from unlabeled images that enable training a VQA model on very little annotated data. In a simple setting similar to CLEVR, we find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
