Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Shravan Murlidaran; Ziqi Wen; Sana Shehabi; and Miguel P. Eckstein

arXiv:2605.17823·cs.CV·May 19, 2026

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, and Miguel P. Eckstein

PDF

TL;DR

This paper demonstrates that a foveated visual language model trained to optimize scene understanding naturally develops fixation patterns similar to humans, suggesting these patterns are linked to perceptual optimization.

Contribution

The study shows that human-like fixation patterns emerge in a computational model trained specifically for scene comprehension, highlighting their functional role.

Findings

01

Model with simulated foveation predicts human fixations accurately.

02

Training for scene comprehension leads to human-like fixation patterns.

03

Peripheral vision quality affects fixation pattern prediction.

Abstract

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.