TL;DR
This paper evaluates how well vision language models can predict human visual attention on user interfaces using eye-tracking data, revealing moderate alignment that varies with UI type and viewing duration.
Contribution
It introduces UIGaze, a comprehensive study assessing VLMs' ability to approximate human gaze patterns on diverse UIs with real eye-tracking data.
Findings
VLMs achieve moderate correlation with human gaze patterns.
Alignment improves with longer viewing durations.
Performance varies significantly across different UI types.
Abstract
Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
