TL;DR
InfiMed introduces a novel training approach combining high-quality textual reasoning data, synthetic reflective CoT, and RLVR to enhance multimodal medical language models, achieving state-of-the-art results on medical benchmarks.
Contribution
The paper presents InfiMed, a new framework that improves medical multimodal LLMs by integrating diverse data and reasoning techniques, surpassing existing models in accuracy and reasoning ability.
Findings
InfiMed-RL-3B outperforms larger models like InternVL3-8B in medical benchmarks.
Incorporating synthetic reflective CoT enhances reasoning capabilities.
Training with 188K SFT samples and 36K RLVR samples yields superior performance.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in domains such as visual understanding and mathematical reasoning. However, their application in the medical domain is constrained by two key challenges: (1) multimodal medical datasets are scarce and often contain sparse information, limiting reasoning depth; and (2) Reinforcement Learning with Verifiable Rewards (RLVR), though effective in general domains, cannot reliably improve model performance in the medical domain. To overcome these challenges, during the supervised fine-tuning (SFT) stage, we incorporate high-quality textual reasoning data and general multimodal data alongside multimodal medical data to efficiently enhance foundational medical capabilities and restore the base model's reasoning ability. Moreover, considering that there are some multimodal medical datasets with sparse information, we…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper is well-written, with a clear and easy-to-follow logical flow. 2. Extensive experiments on multiple benchmarks demonstrate the competitive performance of the proposed InfiMed-Series models. 3. The ablation studies showcase the effectiveness of each component.
1. Both the SFT and RLVR stages adopt mixed data (such as general multimodal data, textual medical data, and reflective-pattern-injected CoT data) for optimization. What are the proportions of these different data types, and how were they determined? Would varying these ratios lead to different effects? Moreover, why not adopt a progressive approach, feeding different types of data sequentially for optimization? 2. When constructing the reflective-pattern-injected CoT data, how is rejection samp
- The paper is well-structured and clearly presented. - The proposed pipeline for building a robust multimodal medical LLM is effective. - The extensive evaluation results convincingly demonstrate the method’s effectiveness
- The overall contribution is somewhat incremental, building on DeepSeek and prior multimodal LLM work. Nonetheless, the execution is solid. - The use of general data in the SFT stage is not novel, as this approach has already been widely applied in LLM training (e.g., BianCangLLM [1]). - The experimental results may not objectively reflect the superiority of this method, since the open-source models and domain-specific LLMs were not fine-tuned on the same datasets as InfiMed. - Since InfiMed us
1. The paper is well written with illustrative figures and good explanations. 2. The results of the model are strong and can often surpass larger-scale models. 3. The analysis part of this paper is useful for future practitioners. It reveals the relationship between the training data and model performance on different benchmarks. Some conclusions are interesting, e.g. reasoning is not always helpful. Will be better if experiments can be done on larger scale or more recent models.
1. The methods in this paper are well known. The reasoning construction, the training pipeline, i.e. SFT followed by RLVR, and the rewarding functions are used by many concurrent papers, and it is hard to distinguish this paper. 2. The conclusions are drawn from a 3B model. As revealed by previous work (e.g. DeekSeek-R1 series), the reasoning capability is best incentivized only on large models. Some conclusions of this paper demonstrate that reasoning might hurt the model performance on a fe
(1) The method is well designed. The paper integrates reflective Chain-of-Thought (CoT) injection with the RLVR framework to enhance the model’s self-evaluation and multi-step reasoning capabilities. (2) The experiments cover multiple medical benchmark datasets, and the proposed method is validated through ablation studies and case analyses, confirming its effectiveness.
1. This study aims to enhance the model’s reasoning and understanding abilities. However, ablation results show that general multimodal data improve performance more than the reflective CoT, suggesting a potential inconsistency with the study’s focus. 2. The methodology and figures are unclear. It is recommended that the authors describe the inference trajectory of the general CoT to help readers understand the model’s reasoning mechanism. Additionally, they should further clarify why multimodal
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
