Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

TL;DR
This paper investigates if visual pretraining enables end-to-end neural network reasoning without explicit object detection, showing that self-supervised pretraining significantly improves compositional generalization on visual reasoning tasks.
Contribution
It introduces a simple self-supervised framework that compresses video frames into tokens and captures temporal dynamics, demonstrating the importance of pretraining for visual reasoning.
Findings
Pretraining is crucial for compositional generalization in visual reasoning.
The proposed self-supervised method outperforms traditional supervised pretraining.
Explicit object detection is not necessary for effective visual reasoning.
Abstract
We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Vision and Imaging · Domain Adaptation and Few-Shot Learning
