RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

Yunsoo Kim; Jinge Wu; Honghan Wu

arXiv:2507.09097·cs.CV·July 15, 2025

RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

Yunsoo Kim, Jinge Wu, Honghan Wu

PDF

Open Access 3 Reviews

TL;DR

RadEyeVideo introduces a novel method to incorporate radiologists' eye-gaze video sequences into large vision-language models, significantly improving chest X-ray analysis and report generation by capturing temporal gaze dynamics.

Contribution

This work is the first to integrate eye-gaze videos into LVLMs for medical imaging, enhancing model performance in clinical tasks beyond existing heatmap or prompt-based methods.

Findings

01

Model performance improved by up to 24.6% in report generation.

02

Average improvement of 15.2% across tasks with eye-gaze videos.

03

RadEyeVideo enabled general LVLMs to outperform task-specific medical models.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The proposed method, RadEyeVideo, combines video, text, and eye movement data to realize the fusion of multi-modal information and improve the accuracy and efficiency of diagnosis. 2. The authors conducted a thorough assessment of various eye-tracking integration techniques, providing strong empirical support for their claims.

Weaknesses

1. In Figure 2, the authors illustrate different prompting methods; however, the figure does not clearly distinguish between the use of text descriptions and video inputs. This ambiguity makes it challenging to understand whether the authors used text descriptions to guide the prompt along with the video input or if they only provided video data. 2. The author's experiments conducted on only one dataset are clearly insufficient in terms of persuasiveness. This limitation may affect the generali

Reviewer 02Rating 5Confidence 5

Strengths

- RadEyeVideo's use of video-based eye-gaze data is a unique contribution that effectively captures the temporal and spatial dynamics of radiologists' focus. I like this idea. - The study demonstrates substantial improvements, particularly in impression generation, highlighting RadEyeVideo's effectiveness in enhancing diagnostic tasks. - The language is clearly presented. The authors use precise and concise language so that the reader can easily understand the methodology, and results of the stu

Weaknesses

- Although this idea is interesting, it still relies on temporal and spatial information in the inference phase, which is difficult to apply to real clinical scenarios. Do the authors consider involving multiple information inputs only in the training phase and simulating zero-shot scenarios as much as possible in the inference phase? - The study’s findings are limited by the small size of the MIMIC-Eye dataset, which may not fully capture the variability in real-world clinical settings, raising

Reviewer 03Rating 6Confidence 4

Strengths

1. This work introduces radiologists' eye-tracking data into LVLMs in video format, highlighting the temporal features of eye movement sequences. 2. The experiments in the paper are comprehensive, validating multiple diagnostic tasks across various datasets.

Weaknesses

1. The paper mentions using simple stacking of gaze points for heatmap generation, but it does not specify the radius size of the gaze points. Different gaze point sizes can affect the model's interpretation. 2. In-context learning often heavily relies on the provided examples, which can significantly influence the generated results. The paper does not discuss what constitutes suitable examples.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Image Retrieval and Classification Techniques · AI in cancer detection