Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, and Shunzhi Yang

TL;DR
This paper introduces Differential Feedback, a method that improves vision-language models by providing process-level supervision through repairing reasoning trajectories, leading to better alignment and reasoning accuracy.
Contribution
It presents a novel automatic supervision technique that enhances multimodal reasoning without requiring extensive human annotations, compatible with existing training frameworks.
Findings
Achieved an average 3% improvement on MMMStar and MathVista benchmarks.
Enabled process-level visual alignment without costly annotations.
Improved stability and reduced hallucinations in VLM reasoning.
Abstract
Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
