TL;DR
CoGaze introduces a novel pretraining framework for chest X-ray analysis that incorporates radiologists' gaze and clinical context to improve diagnostic reasoning and cross-modal alignment.
Contribution
It presents a context- and gaze-guided vision-language pretraining method that models radiologists' diagnostic workflow and enhances performance across multiple medical imaging tasks.
Findings
Outperforms state-of-the-art methods in report generation, classification, and retrieval.
Achieves up to +2.0% CheXbertF1 and +23.2% AUROC improvements.
Effectively leverages gaze and clinical context for better model understanding.
Abstract
Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
