CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Wenjie Li; Yujie Zhang; Haoran Sun; Yueqi Li; Fanrui Zhang; Mengzhe Xu; Victoria Borja Clausich; Sade Mellin; Renhao Yang; Chenrun Wang; Jethro Zih-Shuo Wang; Shiyi Yao; Gen Li; Yidong Xu; Hanyu Wang; Yilin Huang; Angela Lin Wang; Chen Shi; Yin Zhang; Jianan Guo; Luqi Yang; Renxuan Li; Yang Xu; Jiawei Liu; Yao Zhang; Lei Liu; Carlos Guti\'errez SanRom\'an; Lei Wang

arXiv:2508.03733·cs.LG·August 7, 2025

CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang

PDF

TL;DR

CX-Mind is a novel multimodal large language model that employs curriculum-guided reinforcement learning to perform interleaved reasoning in chest X-ray diagnosis, significantly improving interpretability and accuracy over existing models.

Contribution

It introduces the first generative model for interleaved 'think-answer' reasoning in CXR tasks, utilizing curriculum-based reinforcement learning and verifiable process rewards.

Findings

01

Outperforms existing models in visual understanding, text generation, and spatiotemporal alignment.

02

Achieves 25.1% performance improvement over comparable CXR-specific models.

03

Surpasses previous methods on clinical datasets with higher recall and expert approval.

Abstract

Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.