Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Shan Ning, Longtian Qiu, Xuming He

TL;DR
Wiki-R1 introduces a curriculum reinforcement learning framework that enhances multimodal large language models' reasoning for knowledge-based visual question answering by systematically controlling training data difficulty.
Contribution
It proposes a novel curriculum learning approach with controllable data generation and sampling strategies to improve KB-VQA performance of MLLMs.
Findings
Achieves state-of-the-art accuracy on Encyclopedic VQA and InfoSeek benchmarks.
Effectively bridges the gap between pretraining and KB-VQA tasks.
Demonstrates significant performance improvements over previous methods.
Abstract
Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper clearly identifies why RL fails in KB-VQA (retrieval noise leads to sparse rewards) and proposes a reasonable data-centric solution to increase reward density and improve downstream reasoning performance. - The core idea is elegant: generating progressive difficulty data through controllable parameters and selecting the most valuable samples using accuracy-based Gaussian sampling to balance between "already learned" and "not yet learned" examples. I particularly appreciate the Observ
- Some key parameters are not specified, including the exact definition of the reward function in the RL objective, and the observation propagation parameters. - The reliance on TF-IDF similarity may restrict the method to surface-level lexical matching, potentially missing semantically similar samples that use different vocabulary. - The paper lacks details needed to assess the true benefit of RL. Specifically: What retrieval configuration was used for the SFT baseline? Does it use the same sam
1. Targeted Problem: Addresses a critical pain point of distribution gap and sparse rewards in RL-based KB-VQA. 2. Generalization: Excels on unseen question splits e.g., InfoSeek Unseen-Q:47.8% vs. prior 40.4, indicating robust reasoning. 3. Component Validity: Ablations clearly show that data curriculum improves DAPO performance, and propagation is necessary for sampling curriculum to work).
1. Limited Retrieval Control: The data generation relies on adjusting retrieval noise (number of candidates, ground-truth inclusion) but does not fully control the type of noise (e.g., irrelevant vs. slightly relevant candidates). 2. Hyperparameter Transparency: No sensitivity analysis for key hyperparameters (e.g., curriculum gap threshold τ, observation propagation smoothing factor α). 3. RL Algorithm Scope: Only uses DAPO as the base RL algorithm—no comparison with other RL methods (e.g., PPO
1. The framework introduces an elegant combination of data-level and sampling-level curricula. The idea of controlling retrieval difficulty rather than merely selecting data is innovative and well-motivated by the sparse reward challenge in KB-VQA. 2. The approach yields notable accuracy gains with only ~40k training samples — far less than prior methods requiring millions — highlighting efficiency and scalability.
1. While the combination of controllable retrieval and curriculum sampling is well-engineered, the theoretical novelty may be seen as incremental over prior curriculum RL works. The core mechanism (progressively harder data and adaptive sampling) is conceptually similar. 2. The performance gains are mostly demonstrated on EVQA and InfoSeek. It’s unclear whether the framework generalizes to other KB-VQA settings (e.g., OK-VQA) or to different retrieval model architectures. 3. The literature revie
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
