GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

David Wong; Zeynep Isik; Bin Wang; Marouane Tliba; Gorkem Durak; Elif Keles; Halil Ertugrul Aktas; Aladine Chetouani; Cagdas Topel; Nicolo Gennaro; Camila Lopes Vendrami; Tugce Agirlar Trabzonlu; Amir Ali Rahsepar; Laetitia Perronne; Matthew Antalek; Onural Ozturk; Gokcan Okur; Andrew C. Gordon; Ayis Pyrros; Frank H. Miller; Amir Borhani; Hatice Savas; Eric Hart; Elizabeth Krupinski; and Ulas Bagci

arXiv:2604.11653·cs.CV·April 14, 2026

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

David Wong, Zeynep Isik, Bin Wang, Marouane Tliba, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Aladine Chetouani, Cagdas Topel, Nicolo Gennaro, Camila Lopes Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur

PDF

1 Repo

TL;DR

GazeVaLM is a comprehensive eye-tracking dataset and benchmark for evaluating clinical perception and AI model performance in authenticating chest X-rays, enabling human-AI comparison and research in medical image analysis.

Contribution

The paper introduces GazeVaLM, a novel dataset and protocol for studying gaze behavior, diagnostic accuracy, and AI-human comparison in chest radiograph authenticity assessment.

Findings

01

Radiologists show high gaze agreement and inter-observer consistency.

02

Benchmarking reveals differences between radiologists and LLMs in diagnostic accuracy.

03

The dataset enables analysis of gaze patterns, decision-making, and AI performance in medical imaging.

Abstract

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/davidcwong/GazeVaLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.