Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation
Guangjing Yang, ZhangYuan Yu, Ziyuan Qin, Xinyuan Song, Huahui Yi, Qingbo Kang, Jun Gao, Yiyue Li, Chenlin Du, Qicheng Lao

TL;DR
This paper introduces VRFT-Aug, a specialized reinforcement fine-tuning framework for medical imaging that enhances perception and reasoning, leading to improved performance over traditional methods.
Contribution
The work presents novel training strategies for reinforcement fine-tuning in medical imaging, integrating perception and reasoning augmentation to improve model reliability.
Findings
VRFT-Aug outperforms standard fine-tuning and RFT baselines on multiple datasets.
The proposed methods stabilize the RFT process in medical imaging.
Practical heuristics for training can be generalized to other medical image tasks.
Abstract
While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper tackles a timely and significant problem: adapting reinforcement fine-tuning (RFT) for large vision-language models to the medical domain.
1. Disjointed Framework Evaluation 2. Limited Novelty of Components
The paper makes a creative and well-motivated extension of RFT from LLMs to medical–language models. By bridging RFT and medical-language reasoning, this work could stand as a practical foundation for safe and interpretable medical AI system. The empirical improvements are consistent and meaningful. Its originality lies not in inventing a new algorithmic family, but in articulating a new decomposition of the RFT pipeline into perception, policy, and reward components, each augmented with domain-
W1. Incremental algorithmic novelty. - The four augmentations (prompt, policy, recitation, fuzzy reward) are conceptually coherent but individually modest extensions of known techniques. Prompt engineering, auxiliary localization, imitation control, and fuzzy reward shaping. The work’s strength is integration rather than theoretical innovation. Given this work aims for medical purpose, I can understand this concatenation of existing techniques tho. W2. Scalability and generalization not demons
This paper addresses the limitations of the GRPO method in enhancing the reasoning capabilities of medical multimodal large vision-language models by proposing three targeted improvements including prompt expansion, auxiliary visual tasks and novel reward designs which demonstrate clear practical value.
In a previous conference, the reviewers had already raised concerns regarding related issues. However, compared to the previous conference, this paper has not addressed these issues, and the overall content remains consistent with the submission to the earlier conference. Therefore, the reviewers’ acknowledgment of the paper’s strengths and their concerns about its weaknesses are consistent with what was expressed in the previous conference: 1.Why does simply expanding the prompt lead to such s
# Strengths 1. Clear decomposition of failure modes (perception and reasoning) and mapping to concrete training knobs (prompt/context, localization transfer, reward shaping). The four components are easy to reproduce conceptually. 2. MFRS alleviates sparse rewards and gives notable gains over binary accuracy rewards on grading datasets. 3. Ablation shows the effectiveness of penalizing recitation, which can generalize better than rewarding it (positive), and is a non-obvious but actionable in
# Weaknesses 1. Evaluation mainly on small/classification datasets; limited open-ended medical reasoning. Many reported wins are on MedMNIST-style classification and a few fine-grained sets; these are simpler than full radiology VQA or report-generation and do not stress long-form reasoning or clinical justification as strongly as prior medical RL papers. The paper’s strongest novelty claims (recitation reward design/sign; localization-to-classification transfer) would be more compelling on har
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
