TL;DR
Vision SR1 is a three-stage self-rewarding reinforcement learning approach that enhances visual reasoning in vision-language models without external supervision or extra GPU costs.
Contribution
It introduces a novel decomposition of reasoning into visual and language components with a self-contained reward mechanism, improving visual reasoning and reducing hallucinations.
Findings
Improves visual reasoning across diverse tasks.
Reduces reliance on language shortcuts and hallucinations.
More efficient than external reward-based methods.
Abstract
Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
