TL;DR
This paper introduces a listener-augmented reinforcement learning framework for vision-language models that improves alignment with human preferences by re-evaluating reasoning traces, leading to better accuracy and out-of-distribution performance.
Contribution
The paper proposes a novel listener-augmented GRPO method that enhances reward calibration and reasoning consistency in vision-language models.
Findings
Achieves 67.4% accuracy on ImageReward benchmark.
Significantly improves out-of-distribution performance (+6%).
Reduces reasoning contradictions compared to baselines.
Abstract
Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
