Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason, Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, and Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin and, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

TL;DR
This comprehensive empirical study evaluates pre-trained visual representations for Embodied AI across diverse tasks, revealing that larger datasets do not always lead to better performance and that task-specific adaptation significantly enhances results.
Contribution
The paper provides the largest systematic evaluation of visual foundation models for Embodied AI, including a new benchmark, CortexBench, and insights into dataset size, diversity, and adaptation effects.
Findings
Scaling dataset size does not universally improve performance.
Task-specific adaptation of VC-1 yields significant gains.
VC-1 outperforms prior models on average and in real-world tests.
Abstract
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
