KEPO: Knowledge-Enhanced Preference Optimization for Multimodal Reasoning with Applications to Medical VQA

Fan Yang; Rui Meng; Trudi Di Qi; Ali Ezzati; Yuxin Wen

arXiv:2602.00400·cs.AI·May 13, 2026

KEPO: Knowledge-Enhanced Preference Optimization for Multimodal Reasoning with Applications to Medical VQA

Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen

PDF

TL;DR

KEPO is a novel reinforcement learning framework that improves reasoning in multimodal models by selectively distilling high-quality trajectories and leveraging knowledge-based exploration, leading to better medical VQA performance.

Contribution

It introduces a unified post-training method combining quality-gated distillation and knowledge-enhanced exploration for reasoning tasks.

Findings

01

KEPO improves training stability in medical VQA.

02

KEPO achieves more coherent reasoning behaviors.

03

KEPO outperforms baseline methods in out-of-distribution tests.

Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.