TL;DR
NoisyGRPO introduces a multimodal reinforcement learning framework that uses noise injection and Bayesian advantage estimation to improve reasoning and robustness of large language models across visual scenarios.
Contribution
It presents a novel RL method combining noise-injected exploration and Bayesian advantage estimation to enhance multimodal reasoning generalization.
Findings
Significant improvement in reasoning quality and robustness on standard benchmarks.
Enhanced generalization in small-scale multimodal models like Qwen2.5-VL 3B.
Better handling of noisy visual inputs and hallucination reduction.
Abstract
Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
