NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu; Shan Ning; Jiaxuan Sun; Xuming He

arXiv:2510.21122·cs.CV·April 8, 2026

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

PDF

2 Repos 2 Models 1 Video

TL;DR

NoisyGRPO introduces a multimodal reinforcement learning framework that uses noise injection and Bayesian advantage estimation to improve reasoning and robustness of large language models across visual scenarios.

Contribution

It presents a novel RL method combining noise-injected exploration and Bayesian advantage estimation to enhance multimodal reasoning generalization.

Findings

01

Significant improvement in reasoning quality and robustness on standard benchmarks.

02

Enhanced generalization in small-scale multimodal models like Qwen2.5-VL 3B.

03

Better handling of noisy visual inputs and hallucination reduction.

Abstract

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation· slideslive