Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
Rong Quan, Yantao Lai, Dong Liang, Jie Qin

TL;DR
This paper introduces ScanVLA, a multimodal model that predicts human visual scanpaths for object search tasks by fusing vision-language features and incorporating historical fixation data.
Contribution
The novel ScanVLA model effectively combines vision-language fusion, historical fixation information, and segmentation guidance to improve scanpath prediction accuracy in object referring tasks.
Findings
ScanVLA significantly outperforms existing methods in object referring scanpath prediction.
Incorporating historical fixation data improves prediction accuracy.
Using a frozen Segmentation LoRA enhances object localization without high computational costs.
Abstract
Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
