Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang

TL;DR
This paper introduces MedE$^2$, a two-stage training pipeline that significantly improves multimodal reasoning in medical AI models by eliciting and enhancing reasoning capabilities through curated data and demonstrations.
Contribution
The paper presents a novel two-stage post-training method, MedE$^2$, specifically designed to improve multimodal reasoning in medical AI models, addressing a gap in current research.
Findings
Models trained with MedE$^2$ outperform baselines on multiple benchmarks.
The approach improves reasoning accuracy and reliability in medical multimodal tasks.
Validation confirms robustness across larger models and inference settings.
Abstract
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper presents comprehensive and rigorous experiments across a diverse suite of medical benchmarks, including MedQA, Medbullets, MedXpertQA-MM, MMMU-Health, and MMMU-Pro-Health. The pipeline is validated on multiple open-source base models (QwenVL2.5 and InternVL3.0 across several parameter scales) and compared against both opensource and proprietary state-of-the-art models (Table 1, Table 2), highlighting its robustness and scalability.
1. While the DPO loss is clearly presented (Equation in Section 3.3), the justification for why the particular Multimodal Medical Reasoning Preference (MMRP) criteria yield good calibration/hallucination mitigation is primarily empirical rather than theoretically grounded. A deeper theoretical or empirical comparison between different preference criteria, the effect of each criteria, or how MMRP compares to alternative alignment objectives (e.g., from MedMMV or LoRA-MedSim) would reinforce the t
1. Clear, effective training recipe with small, high-quality data: Stage-I text-only elicitation reliably boosts complex clinical reasoning (including multimodal benchmarks) and scales better with larger base models; Stage-II DPO with MMRP further reduces hallucinations and promotes image-grounded, reflective reasoning. 2. Strong empirical validation and practicality: Consistent gains over competitive baselines and larger open-source models, robustness to inference-time scaling, careful data cu
### Necessity of Stage-I and missing baselines 1. Many medical LLMs (e.g., Med-PaLM, BioMedGPT, LLaVA-Med) already show basic chain-of-thought (CoT) capabilities under prompting. Why is a separate text-only reasoning SFT (Stage-I) still necessary? If Stage-I is skipped and you go straight to multimodal training or preference alignment, what concrete negative effects do you anticipate (e.g., unstable reasoning chains, higher cross-modal hallucination, reduced utilization of image evidence, degra
1. The paper presents a clear and well-motivated rationale for developing multimodal clinical LLMs. 2. It introduces a carefully curated multimodal medical dataset. 3. The method leverages reinforcement learning through Direct Preference Optimization (DPO) to mitigate hallucinations in reasoning. 4. The proposed approach achieves superior quantitative performance compared to existing multimodal LLMs.
1. Limited contributions and conceptual overlap. The paper shows substantial conceptual overlap with prior work — ClinRagen [1]. The two-stage reasoning distillation framework, involving textual reasoning elicitation followed by multimodal enhancement, closely resembles existing work on clinical reasoning generation (e.g., ClinRagen). The contribution thus appears incremental or derivative, and the ClinRagen paper is neither cited nor properly acknowledged their contributions. 2. Lack of novel
1. Clear motivation: The performance drop of LVLMs in medical reasoning tasks is convincingly demonstrated by empirical evidence. 2. Well-designed training framework: The two-stage strategy effectively avoids the difficulty of reward model construction in the medical domain. 3. High-quality data construction with professional validation significantly enhances training reliability. 4. Strong performance improvements, including outperforming much larger proprietary models, demonstrating good scala
1. Limited conceptual novelty: The method closely aligns with existing CoT SFT + DPO pipelines (e.g., DeepSeek-R1), with improvements mainly on the medical data side. 2. Unclear contribution of the visual modality: It remains uncertain whether the model truly leverages image features or merely relies on textual priors. 3. Small-scale medical data (5k) with unclear diversity and potential hidden curation bias. 4. Reliance on closed-source proprietary models for filtering and preference judging af
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare
