Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze; Li Pengyi; Matvey Skripkin; Andrey Galichin; Anton Gusarov; Konstantin Sobolev; Andrey Kuznetsov; Ivan Oseledets

arXiv:2506.22832·cs.CV·April 13, 2026

Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets

PDF

1 Repo 1 Models

TL;DR

This paper introduces a listener-augmented reinforcement learning framework for vision-language models that improves alignment with human preferences by re-evaluating reasoning traces, leading to better accuracy and out-of-distribution performance.

Contribution

The paper proposes a novel listener-augmented GRPO method that enhances reward calibration and reasoning consistency in vision-language models.

Findings

01

Achieves 67.4% accuracy on ImageReward benchmark.

02

Significantly improves out-of-distribution performance (+6%).

03

Reduces reasoning contradictions compared to baselines.

Abstract

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner
github

Models

🤗
alexgambashidze/qwen2.5vl_image_preference_reasoner
model· 2 dl· ♡ 1
2 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.