The Curious Robot: Learning Visual Representations via Physical Interactions
Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, Abhinav, Gupta

TL;DR
This paper introduces a robotic system that learns visual representations through physical interactions with objects, demonstrating that active manipulation provides effective supervision for visual learning, outperforming passive observation methods.
Contribution
The study presents one of the first systems where a robot learns visual features via physical interactions, showing improvements over traditional passive learning approaches.
Findings
Robot collected over 130K interaction-based data points.
Learned representations outperform passive methods in image classification.
Network achieves 3% higher recall@1 than ImageNet-trained models.
Abstract
What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
