V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen

TL;DR
V-Zero introduces a self-improving framework for multimodal reasoning that enhances vision-language models using only unlabeled images through a co-evolutionary loop of question generation and solving.
Contribution
It presents a novel self-supervised training method for multimodal models that eliminates the need for human annotations by iteratively improving question generation and answering.
Findings
Achieves performance gains on Qwen2.5-VL-7B-Instruct without human labels.
Improves visual mathematical reasoning by +1.7 points.
Enhances general vision-centric tasks by +2.6 points.
Abstract
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
