TL;DR
RISE introduces a reliable self-evolving framework for vision-language models that enhances their reasoning abilities by addressing key challenges in autonomous question generation and skill maintenance.
Contribution
The paper proposes a novel self-evolving approach with fine-grained role alternation, quality supervision, and dynamic balancing to improve VLMs without extensive human supervision.
Findings
Consistent performance improvements across seven benchmarks.
Enhanced question validity and pseudo-label reliability.
Broader and sustained skill coverage during evolution.
Abstract
Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
