A Computational Account Of Self-Supervised Visual Learning From Egocentric Object Play
Deepayan Sanyal, Joel Michelson, Yuan Yang, James Ainooson and, Maithilee Kunda

TL;DR
This paper explores how self-supervised contrastive learning using egocentric videos of object manipulation can improve visual representations and classification accuracy, inspired by child development insights.
Contribution
It introduces a method leveraging physical viewpoint variations in egocentric videos to enhance self-supervised visual learning, demonstrating improved downstream classification performance.
Findings
Viewpoint-equated representations improve classification accuracy.
Performance gains are robust to viewpoint gap variations.
Benefits transfer across multiple image classification tasks.
Abstract
Research in child development has shown that embodied experience handling physical objects contributes to many cognitive abilities, including visual learning. One characteristic of such experience is that the learner sees the same object from several different viewpoints. In this paper, we study how learning signals that equate different viewpoints -- e.g., assigning similar representations to different views of a single object -- can support robust visual learning. We use the Toybox dataset, which contains egocentric videos of humans manipulating different objects, and conduct experiments using a computer vision framework for self-supervised contrastive learning. We find that representations learned by equating different physical viewpoints of an object benefit downstream image classification accuracy. Further experiments show that this performance improvement is robust to variations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
