What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri; Marouane Tliba; Bin Wang; Aladine Chetouani; Ulas Bagci; Alessandro Bruno

arXiv:2604.08494·cs.CV·April 10, 2026

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno

PDF

TL;DR

This paper introduces a semantic scanpath similarity framework using vision-language models and NLP metrics to evaluate eye-tracking data, capturing content-based similarity beyond spatial alignment.

Contribution

It integrates VLMs and NLP metrics into scanpath analysis, enabling content-aware, interpretable similarity measures that complement traditional spatial methods.

Findings

01

Semantic similarity captures variance independent of geometric alignment.

02

Content-based measures reveal high content agreement despite spatial divergence.

03

Contextual encoding impacts description fidelity and metric stability.

Abstract

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.