Reinforcing Multimodal Reasoning Against Visual Degradation
Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

TL;DR
This paper introduces ROMA, a reinforcement learning framework that enhances multimodal reasoning models' robustness against visual degradations like blur and low resolution, without sacrificing performance on clean data.
Contribution
ROMA modifies RL fine-tuning dynamics with novel strategies to improve robustness against visual corruptions while maintaining accuracy on clean inputs.
Findings
ROMA improves robustness by +2.4% on seen corruptions.
ROMA improves robustness by +2.3% on unseen corruptions.
ROMA matches clean accuracy on multimodal reasoning benchmarks.
Abstract
Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
