A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Yabin Zhang; Chong Wang; Yunhe Gao; Jiaming Liu; Maya Varma; Justin Xu; Sophie Ostmeier; Jin Long; Sergios Gatidis; Seena Dehkharghani; Arne Michalson; Eun Kyoung Hong; Christian Bluethgen; Haiwei Henry Guo; Alexander Victor Ortiz; Stephan Altmayer; Sandhya Bodapati; Joseph David Janizek; Ken Chang; Jean-Benoit Delbrouck; Akshay S. Chaudhari; Curtis P. Langlotz

arXiv:2604.00493·cs.CV·April 2, 2026

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati

PDF

1 Models

TL;DR

CheXOne is a reasoning-enabled vision-language model for chest X-ray interpretation that generates diagnostic predictions along with explicit, clinically grounded reasoning traces, improving interpretability and performance.

Contribution

The paper introduces CheXOne, a novel model that combines instruction tuning and reinforcement learning to produce explicit reasoning traces in CXR interpretation, enhancing transparency and accuracy.

Findings

01

CheXOne outperforms existing models on multiple benchmarks.

02

Clinical study shows reports are comparable or better than resident reports in 55% of cases.

03

Generated reasoning traces demonstrate high clinical factuality and causal support.

Abstract

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
StanfordAIMI/CheXOne
model· 2.2k dl· ♡ 13
2.2k dl♡ 13

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.