Human Gaze Boosts Object-Centered Representation Learning
Timothy Schauml\"offel, Arthur Aubret, Gemma Roig, Jochen Triesch

TL;DR
This study shows that emphasizing central visual information around gaze points in egocentric videos improves object-centered representation learning, inspired by human visual processing, and leverages gaze dynamics for better visual understanding.
Contribution
The paper introduces a gaze-centered cropping approach in SSL models trained on egocentric videos, demonstrating improved object-centered representations inspired by human vision.
Findings
Focusing on gaze-centered regions enhances object representation quality.
Temporal gaze dynamics contribute to stronger visual features.
Gaze-based cropping outperforms uniform visual inputs in SSL training.
Abstract
Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Face Recognition and Perception
