Temporal Slowness in Central Vision Drives Semantic Object Learning
Timothy Schauml\"offel, Arthur Aubret, Gemma Roig, Jochen Triesch

TL;DR
This paper demonstrates that leveraging temporal slowness in central vision enhances the learning of semantic object representations from naturalistic egocentric visual streams, providing insights into human visual learning mechanisms.
Contribution
It introduces a novel approach combining central vision focus and temporal slowness in self-supervised learning to improve semantic object encoding from human-like visual experience.
Findings
Temporal slowness improves object semantic encoding.
Focusing on central vision enhances foreground feature extraction.
Eye movements combined with slowness encode broader semantic information.
Abstract
Humans acquire semantic object representations from egocentric visual streams with minimal supervision, but the underlying mechanisms remain unclear. Importantly, the visual system only processes the center of its field of view with high resolution and it learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and a state-of-the-art gaze prediction model. We extract image crops around predicted gaze locations to train a time-contrastive Self-Supervised Learning model. Our results show that exploiting temporal slowness when learning from central visual field experience…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-written and easy to follow. 2. The set of experiments and analyses done to present evidence of the claims is quite meticulous and impressive. I want to particularly appreciate the analysis on CKA similarity between learned representations and Glove-based object co-occurrence embeddings. 3. The performance metrics for some of the tasks like fine-grained and instance-level recognition are quite impressive.
1. Even though the paper focuses on central vision, simply ignoring peripheral vision might overlook many insights to the human visual system. 2. Since the gaze crop consists of almost half of the scene (224/336), and the gaze location is heavily biased to the center (Fig. 10 of the appendix), I am not sure if a gaze location is necessary to crop the frame. What would the results look like if the crop was a 224X224 or 336X336 box centered in the frame? 3. I am not convinced that a biologically
- The integration of central vision with temporal slowness is a focused, biologically motivated idea that appears to yield practical improvements on object-centric tasks, suggesting value for egocentric and embodied learning communities. - The empirical exploration includes reasonable sweeps over crop sizes and temporal pairing that reveal interpretable behavior and replicate across backbones. - The end-to-end pipeline is presented clearly with intuitive figures that explain how gaze-centered cr
- There is a heavy reliance on the predicted gaze rather than ground-truth gaze, and it is not accompanied by any calibration/error metrics. The paper could benefit from reporting prediction error characteristics and relating them to performance when ground-truth gaze is substituted for the eye-tracked subset. - It remains unclear whether the gains arise from human-like fixations or from general object-biased views. The paper could benefit from controls using fixed center crops, saliency-only cr
1) The paper aims to study the role of central vision and temporal slowness in the formation of semantic object representations in humans. The motivation is interesting. 2) Extensive experiments are conducted to evaluate effectiveness of the proposed feature learning on four downstream vision recognition tasks.
1) **Lack of technical novelty**. The technical approach of the paper, that is the extraction of gaze-centered image crops with existing gaze prediction model and the self-supervised constrastive learning guided by temporal distance, are very basic technical processes. Thus there is lack of sufficient technical contributions of the paper. 2) **Lack of sufficient performance comparison**. The paper didn't make sufficient comparison with SOTA feature learning methods to validate the superior pe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Face Recognition and Perception
