Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

Rong Quan; Yantao Lai; Dong Liang; Jie Qin

arXiv:2604.20361·cs.CV·April 23, 2026

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

Rong Quan, Yantao Lai, Dong Liang, Jie Qin

PDF

TL;DR

This paper introduces ScanVLA, a multimodal model that predicts human visual scanpaths for object search tasks by fusing vision-language features and incorporating historical fixation data.

Contribution

The novel ScanVLA model effectively combines vision-language fusion, historical fixation information, and segmentation guidance to improve scanpath prediction accuracy in object referring tasks.

Findings

01

ScanVLA significantly outperforms existing methods in object referring scanpath prediction.

02

Incorporating historical fixation data improves prediction accuracy.

03

Using a frozen Segmentation LoRA enhances object localization without high computational costs.

Abstract

Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.