Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li; Wenhao Yu; Chengsong Huang; Zhenwen Liang; Rui Liu; Fuxiao Liu; Jingxi Che; Dian Yu; Jordan Boyd-Graber; Haitao Mi; Dong Yu

arXiv:2508.19652·cs.CV·April 28, 2026

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu

PDF

1 Repo

TL;DR

Vision SR1 is a three-stage self-rewarding reinforcement learning approach that enhances visual reasoning in vision-language models without external supervision or extra GPU costs.

Contribution

It introduces a novel decomposition of reasoning into visual and language components with a self-contained reward mechanism, improving visual reasoning and reducing hallucinations.

Findings

01

Improves visual reasoning across diverse tasks.

02

Reduces reliance on language shortcuts and hallucinations.

03

More efficient than external reward-based methods.

Abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zli12321/Vision-SR1
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.