Self-supervised visual learning from interactions with objects
Arthur Aubret, C\'eline Teuli\`ere, Jochen Triesch

TL;DR
This paper proposes a method to enhance self-supervised visual learning by incorporating object-related actions, leading to improved object category recognition through better viewpoint alignment.
Contribution
It introduces a novel loss function that aligns action and visual embeddings, leveraging embodied interactions to structure visual representations in SSL.
Findings
Outperforms previous SSL methods on category recognition tasks.
Improves viewpoint-wise alignment of objects within the same category.
Embodied actions contribute to more robust visual representations.
Abstract
Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
