KEPO: Knowledge-Enhanced Preference Optimization for Multimodal Reasoning with Applications to Medical VQA
Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen

TL;DR
KEPO is a novel reinforcement learning framework that improves reasoning in multimodal models by selectively distilling high-quality trajectories and leveraging knowledge-based exploration, leading to better medical VQA performance.
Contribution
It introduces a unified post-training method combining quality-gated distillation and knowledge-enhanced exploration for reasoning tasks.
Findings
KEPO improves training stability in medical VQA.
KEPO achieves more coherent reasoning behaviors.
KEPO outperforms baseline methods in out-of-distribution tests.
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
