Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps
Ziqi Wen, Jonathan Skaza, Shravan Murlidaran, William Y. Wang, Miguel P. Eckstein

TL;DR
This paper introduces a novel foveated scene understanding map model that predicts human response times in scene comprehension tasks, outperforming traditional image metrics by integrating visual and linguistic analysis.
Contribution
The paper presents a new foveated vision model combined with vision-language models to predict human scene understanding times, emphasizing the role of foveated processing.
Findings
F-SUM score correlates with human response times (r=0.47) and saccades (r=0.51).
F-SUM predicts description accuracy (r=-0.56) in time-limited scenes.
Model outperforms standard image-based metrics like clutter and visual complexity.
Abstract
Although models exist that predict human response times (RTs) in tasks such as target search and visual discrimination, the development of image-computable predictors for scene understanding time remains an open challenge. Recent advances in vision-language models (VLMs), which can generate scene descriptions for arbitrary images, combined with the availability of quantitative metrics for comparing linguistic descriptions, offer a new opportunity to model human scene understanding. We hypothesize that the primary bottleneck in human scene understanding and the driving source of variability in response times across scenes is the interaction between the foveated nature of the human visual system and the spatial distribution of task-relevant visual information within an image. Based on this assumption, we propose a novel image-computable model that integrates foveated vision with VLMs to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Neurobiology of Language and Bilingualism
