Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning; Longtian Qiu; Xuming He

arXiv:2603.05256·cs.CV·March 6, 2026

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning, Longtian Qiu, Xuming He

PDF

Open Access 3 Reviews

TL;DR

Wiki-R1 introduces a curriculum reinforcement learning framework that enhances multimodal large language models' reasoning for knowledge-based visual question answering by systematically controlling training data difficulty.

Contribution

It proposes a novel curriculum learning approach with controllable data generation and sampling strategies to improve KB-VQA performance of MLLMs.

Findings

01

Achieves state-of-the-art accuracy on Encyclopedic VQA and InfoSeek benchmarks.

02

Effectively bridges the gap between pretraining and KB-VQA tasks.

03

Demonstrates significant performance improvements over previous methods.

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper clearly identifies why RL fails in KB-VQA (retrieval noise leads to sparse rewards) and proposes a reasonable data-centric solution to increase reward density and improve downstream reasoning performance. - The core idea is elegant: generating progressive difficulty data through controllable parameters and selecting the most valuable samples using accuracy-based Gaussian sampling to balance between "already learned" and "not yet learned" examples. I particularly appreciate the Observ

Weaknesses

- Some key parameters are not specified, including the exact definition of the reward function in the RL objective, and the observation propagation parameters. - The reliance on TF-IDF similarity may restrict the method to surface-level lexical matching, potentially missing semantically similar samples that use different vocabulary. - The paper lacks details needed to assess the true benefit of RL. Specifically: What retrieval configuration was used for the SFT baseline? Does it use the same sam

Reviewer 02Rating 4Confidence 4

Strengths

1. Targeted Problem: Addresses a critical pain point of distribution gap and sparse rewards in RL-based KB-VQA. 2. Generalization: Excels on unseen question splits e.g., InfoSeek Unseen-Q:47.8% vs. prior 40.4, indicating robust reasoning. 3. Component Validity: Ablations clearly show that data curriculum improves DAPO performance, and propagation is necessary for sampling curriculum to work).

Weaknesses

1. Limited Retrieval Control: The data generation relies on adjusting retrieval noise (number of candidates, ground-truth inclusion) but does not fully control the type of noise (e.g., irrelevant vs. slightly relevant candidates). 2. Hyperparameter Transparency: No sensitivity analysis for key hyperparameters (e.g., curriculum gap threshold τ, observation propagation smoothing factor α). 3. RL Algorithm Scope: Only uses DAPO as the base RL algorithm—no comparison with other RL methods (e.g., PPO

Reviewer 03Rating 6Confidence 4

Strengths

1. The framework introduces an elegant combination of data-level and sampling-level curricula. The idea of controlling retrieval difficulty rather than merely selecting data is innovative and well-motivated by the sparse reward challenge in KB-VQA. 2. The approach yields notable accuracy gains with only ~40k training samples — far less than prior methods requiring millions — highlighting efficiency and scalability.

Weaknesses

1. While the combination of controllable retrieval and curriculum sampling is well-engineered, the theoretical novelty may be seen as incremental over prior curriculum RL works. The core mechanism (progressively harder data and adaptive sampling) is conceptually similar. 2. The performance gains are mostly demonstrated on EVQA and InfoSeek. It’s unclear whether the framework generalizes to other KB-VQA settings (e.g., OK-VQA) or to different retrieval model architectures. 3. The literature revie

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling