TL;DR
MedTVT-R1 is a novel multimodal large language model designed to improve multi-disease diagnosis by integrating heterogeneous medical data, employing a new dataset and reinforcement learning techniques for better reasoning and interpretability.
Contribution
We introduce MedTVT-R1, a multimodal LLM framework with a new dataset and reinforcement learning methods, advancing multi-disease diagnosis and interpretability in medical AI.
Findings
Outperforms existing models in multimodal feature utilization
Achieves superior accuracy in multi-disease diagnosis
Enhances diagnostic reasoning with reinforcement fine-tuning
Abstract
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The three-stage training strategy (PT, SFT, RFT) progressively builds model capabilities - The Modality Perception Layer design with CMHA and CAO is technically justified for handling heterogeneous modalities - The use of GRPO with Jaccard reward is appropriate for the multi-label disease prediction task - The paper includes good comparisons with multiple baselines (8 general-purpose + 3 medical-specific MLLMs)
- The GRPO training uses only 500 iterations, which might be insufficient for convergence - Single dataset validation (MIMIC-IV) to assess generalization - Small model (1B) is used in the experiment
1.The proposed MedTVT-R1 framework effectively integrates ECG, CXR, and laboratory data, demonstrating a well-designed multimodal large language model (MLLM) architecture that addresses the inherent limitations of single-modality approaches. 2.The introduction of the MedTVT-QA dataset with a Chain of Evidence (CoE) structure represents a meaningful contribution, enabling reasoning over physiological processes and multi-disease diagnosis in a structured and interpretable manner. 3.The use of Rein
1.The methodological innovation of this paper appears limited, as the proposed framework seems to be a combination or extension of existing approaches rather than a fundamentally novel contribution. 2. The paper does not include comparative experiments with the reward function used in DeepSeek-R1. To more convincingly support the claimed advantages, it is recommended that the authors include corresponding comparative studies. 3. The adaptive weighting fusion introduces modality bias, causing the
1. An innovative modality interaction and integration framework for LLMs to understand multimodal EHRs. 2. A newly constructed multimodal dataset based on MIMIC for multimodal reasoning tasks. 3. A well-organized pipeline for dataset generation and model training.
1. In EHRs, especially those derived from the MIMIC datasets, clinical notes are a crucial modality that reflects patients’ health states. The authors should clarify why this modality was not included. 2. The paper lacks citations to prior multimodal medical reasoning studies, such as RAIM [1], ClinRaGen [2], etc. 3. The dataset only covers a small number of diseases. The authors should explain the rationale behind selecting only Coronary Artery Disease, Acute Renal Failure, Hypertension, Atri
The problem this paper addresses is clinically relevant and important, tackling significant research gaps. The MedTVT-QA dataset, which is claimed to be the first of its kind, is a valuable contribution that could serve as a foundation for future related work. The use of GRPO increases the novelty of this work. Overall, the writing is clear and easy to follow.
The experimental design has potential issues that warrant clarification, particularly concerning the choice of baselines and the unusual training/testing data split. The persuasiveness of the claimed effectiveness relies on addressing these aspects; please see details in the "Questions" section.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
