Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation

Tanjim Islam Riju; Shuchismita Anwar; Saman Sarker Joy; Farig Sadeque; Swakkhar Shatabda

arXiv:2508.13068·cs.CV·August 19, 2025

Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation

Tanjim Islam Riju, Shuchismita Anwar, Saman Sarker Joy, Farig Sadeque, Swakkhar Shatabda

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a two-stage multimodal framework that uses gaze data to improve disease classification and report generation from chest X-rays, enhancing interpretability and diagnostic accuracy.

Contribution

It presents a novel gaze-guided contrastive learning architecture and a modular report generation pipeline leveraging eye-tracking data for better medical image analysis.

Findings

01

Gaze-informed models outperform baseline in classification metrics.

02

Incorporating eye-tracking data improves report relevance and accuracy.

03

The approach enhances interpretability of AI in radiology.

Abstract

We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

The main strengths of the paper are that it presents a novel multimodal framework that combines eye-tracking data, radiology text, anatomical masks, and chest X-ray images. The two-stage design, which combines region-grounded text generation and gaze-guided classification, is both conceptually and clinically sound. The methodological rigor of the proposed method is further reinforced by thorough ablations and meticulously documented dataset curation.

Weaknesses

Major: 1. The main flaw is that I fail to see how this approach advances clinical practice. Annotations are even more challenging than finding medical data to train AI models. It might not be feasible to design an eye gaze experiment to regulate the annotations in the suggested method. 2. One of the major concerns is the experimental validation of the proposed method. A few points that are lacking: a) an experiment that uses shuffled/mismatched transcripts or eliminates text from Stage 1 (and f

Reviewer 02Rating 4Confidence 4

Strengths

1. The use of radiologist gaze as explicit spatial supervision is innovative and well-motivated. The trust-calibrated composite loss (MSE, KL, Pearson, CoM) is a thoughtful approach to aligning model attention with human expertise. 2. The detailed alignment procedure for MIMIC-Eye demonstrates careful handling of heterogeneous annotations. 3. It provides clear evidence of incremental benefits from each modality, and the extended ablations examine backbone choices systematically. 4. Comparing si

Weaknesses

1. The entire evaluation is on a single dataset (MIMIC-Eye, n=2,877 aligned samples). It lacks comparison with state-of-the-art on standard benchmarks. 2. When comparing report generation performance, the author does not compare their methods with current report generation models. 3. This method introduces many modules for intermediate fixation prediction but they do not compare the computational efficiency of their method and current report generation methods. 4. I think this method is somehow

Reviewer 03Rating 4Confidence 3

Strengths

Important problem - Producing radiology reports that ground findings in correct spatial regions in the image is a high impact problem. Multimodal approach - This paper proposes a method to integrate many streams of data with complementary information (the image, the text report, gaze data, and bounding box data) to improve generation performance, which is an interesting approach. - The particular approach of using gaze data does seem to improve performance compared to not using gaze data and

Weaknesses

Doubts about the report generation step - This approach to report generation seems to be more pattern memorizing and pattern matching than true spatial grounding. The authors rely on condition-specific vocabularies (built up offline via processing the MIMIC database with an LLM) to result in phrases that “sound like” radiologists, as well as “ground” those findings using 17 general thoracic regions identified via normalized bounding boxes. While I believe this approach would appear to work well

Reviewer 04Rating 2Confidence 4

Strengths

Highly practical: Gaze alignment is introduced during training, but only images are used during inference. The paper uses a "learnable gaze token + composite alignment loss (MSE/KL/correlation/centroid) + curriculum-based weighting" to constrain attention to human gaze during training. The final test does not rely on any gaze or text input, while still improving discrimination metrics (AUC/F1) and obtaining more interpretable attention maps. This "gaze-for-training, image-only-inference" design

Weaknesses

1. Insufficient distinction between novelty and prior work. The paper introduces gaze supervision into multimodal chest radiograph classification and report generation, using a core approach of gaze tokens + composite alignment loss + regionalized report generation. However, compared to existing work using gaze for medical representation alignment/attention guidance (e.g., using gaze as weak supervision, as channel/mask input, or as a constraint in multimodal contrastive learning), the methodolo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiology practices and education · COVID-19 diagnosis using AI