Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
Shravan Murlidaran, Ziqi Wen, Sana Shehabi, and Miguel P. Eckstein

TL;DR
This paper demonstrates that a foveated visual language model trained to optimize scene understanding naturally develops fixation patterns similar to humans, suggesting these patterns are linked to perceptual optimization.
Contribution
The study shows that human-like fixation patterns emerge in a computational model trained specifically for scene comprehension, highlighting their functional role.
Findings
Model with simulated foveation predicts human fixations accurately.
Training for scene comprehension leads to human-like fixation patterns.
Peripheral vision quality affects fixation pattern prediction.
Abstract
When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
