Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, Haiyun Jiang

TL;DR
Vision-EKIPL introduces an external knowledge-infused reinforcement learning framework that enhances visual reasoning in multimodal models by guiding policy learning with auxiliary model-generated actions, leading to improved performance and training efficiency.
Contribution
It proposes a novel RL framework that incorporates external auxiliary model actions to expand exploration and improve reasoning in visual question answering tasks.
Findings
Achieved up to 5% performance improvement on Reason-RFT-CoT Benchmark.
Significantly accelerates training convergence and efficiency.
Overcomes limitations of traditional RL methods in visual reasoning.
Abstract
Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model's reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model's exploration space, effectively improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
