Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding; Ziliang Qiu; Bolian Li; Ruqi Zhang

arXiv:2602.08503·cs.CV·February 10, 2026

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

PDF

Open Access 1 Models

TL;DR

This paper introduces Octopus, a reinforcement learning framework that enhances self-correction in vision-language models by synthesizing dense training examples, leading to improved performance and efficiency across multiple benchmarks.

Contribution

The paper presents Octopus, a novel RL rollout augmentation method with response-masking, enabling effective self-correction learning in large vision-language models.

Findings

01

Achieves state-of-the-art results on 7 benchmarks.

02

Outperforms RLVR baseline by 1.0 score.

03

Requires only 0.72x training time per step.

Abstract

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Tuwhy/Octopus-8B
model· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications