Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Jiaer Xia, Yuhang Zang, Peng Gao, Sharon Li, Kaiyang Zhou

TL;DR
Visionary-R1 demonstrates that training visual language models with reinforcement learning and detailed captioning can improve reasoning abilities and reduce shortcut learning, outperforming existing models on visual reasoning tasks.
Contribution
The paper introduces Visionary-R1, a novel approach that trains VLMs with reinforcement learning and caption-based reasoning to mitigate shortcuts and enhance generalization.
Findings
Visionary-R1 outperforms GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro on visual reasoning benchmarks.
Training with caption-reason-answer format reduces shortcut learning in VLMs.
Reinforcement learning with detailed captioning improves reasoning capabilities.
Abstract
Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) Visionary-R1 introduces a conceptually simple but powerful “caption-before-reason” strategy that forces the model to understand the image context before reasoning. 2) Unlike prior VLMs requiring large-scale GPT-4-generated CoT supervision, Visionary-R1 is trained purely from question–answer pairs, significantly improving scalability and autonomy. 3) The inclusion of an auxiliary caption reward explicitly reduces the tendency to rely on superficial visual cues, encouraging deeper, generalizabl
1) Because the R1 training process lacks explicit CoT supervision, it is uncertain whether the observed problems stem solely from shortcut learning. Other possible causes, such as unstable reward optimization or limited exploration, may also contribute. The paper should analyze these alternatives more carefully or justify why shortcut bias is the only plausible explanation. 2) In Figure 1, the paper mentions “shortcut” but does not clearly define what it refers to. In this paper, it seems relate
1.The paper clearly identifies a practical failure mode — shortcut learning in visual RL — and provides strong empirical evidence of this phenomenon. 2. The caption–reason–answer structure and caption reward are conceptually simple but yield measurable generalization improvements. 3. The methodology is well-detailed, including architecture, rewards, and prompt templates. The authors commit to releasing code and models.
1. The caption reward relies on the model’s own LLM component to verify if the caption enables correct answering — this could lead to reward leakage or self-confirmation bias. There’s no analysis of how often the reward misfires. 2. While outperforming on MathVista/MMBench, the improvement on some datasets (e.g., MathVision, MMStar) is modest, suggesting the gains may stem from formatting or stylistic changes rather than deeper reasoning. 3. Longer outputs correlate with accuracy, but this metri
1. By enforcing the “Describe–Reason–Answer” output format, the model must first produce a detailed image description before reasoning. This clever design ensures the model deeply understands image content rather than relying on superficial patterns. 2. The dataset is broad, the evaluation benchmarks are comprehensive, and the paper provides extensive ablation studies and hyperparameter analyses, which strengthen the credibility of the conclusions.
1. The proposed approach still relies on a VLM to generate reasoning chains during training. How is this fundamentally different from methods that extract reasoning chains from proprietary models such as GPT-4o? 2. The discussion of related work is incomplete — the paper does not adequately cover other recent approaches to avoiding shortcut learning (e.g., causal intervention or disentangled representation learning) and lacks a thorough comparison with similar “intermediate supervision” methods
1. Incorporating a caption-based reward into GRPO is an interesting extension of GRPO for multimodal reasoning. 2. The paper is clearly written and easy to follow.
[Major Weakness] * The motivation for tackling the shortcut learning is not fully convincing. This issue might be mitigated by simply modifying the training data (e.g., filtering easy samples or injecting more complex QA pairs) without altering the training objective. * The novelty is limited. Prior works such as [1] and [2] have already demonstrated that enforcing a model to caption before answering improves performance. As a result, it seems incremental because this work mainly enforces captio
1. The Visionary-R1 approach is simple, easy to scale and uses end to end reinforcement learning which makes it cheaper to train since it only uses QA pairs compared to some of the baselines which use SFT and hence need to obtain CoT traces. 2. The paper provides insightful ways to train VLMs and ablates the training recipe choices such as the caption reward and the annealed KL coefficient well, which can be useful for other practitioners training VLMs with RL. 3. Their 3B fine-tuned model bea
1. While the authors argue that training with GRPO directly leads to a shortcut reasoning issue, the paper lacks more detailed quantitative claims on this. It would be helpful if the authors can provide some metrics for example solution length over time when training with GRPO. Figure 5 does include some plots but since the KL coefficient is also being varied and it uses the caption reward, so it is hard to conclude. In addition to training metrics, it would be interesting to check this if the a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · EEG and Brain-Computer Interfaces · Neural and Behavioral Psychology Studies
