Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang; Zeyu Zhang; Meng Meng; Xu Zhou; Haiyun Jiang

arXiv:2506.06856·cs.CV·May 7, 2026

Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, Haiyun Jiang

PDF

TL;DR

Vision-EKIPL introduces an external knowledge-infused reinforcement learning framework that enhances visual reasoning in multimodal models by guiding policy learning with auxiliary model-generated actions, leading to improved performance and training efficiency.

Contribution

It proposes a novel RL framework that incorporates external auxiliary model actions to expand exploration and improve reasoning in visual question answering tasks.

Findings

01

Achieved up to 5% performance improvement on Reason-RFT-CoT Benchmark.

02

Significantly accelerates training convergence and efficiency.

03

Overcomes limitations of traditional RL methods in visual reasoning.

Abstract

Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model's reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model's exploration space, effectively improves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.