Reward Design for Physical Reasoning in Vision-Language Models
Derek Lilienthal, Manisha Mukherjee, and Sameera Horawalavithana

TL;DR
This paper systematically studies how different reward signals influence physical reasoning in vision-language models, revealing that reward design impacts reasoning behaviors and performance variably across domains.
Contribution
It introduces a novel internal attention-based reward and provides a comprehensive ablation study on reward effects in physical reasoning tasks.
Findings
Accuracy-based rewards yield the strongest overall performance gains.
Rubric rewards enhance structured reasoning but do not always improve accuracy.
Attention-based rewards improve spatial reasoning without harming symbolic reasoning.
Abstract
Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
