V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

Han Wang; Yi Yang; Jingyuan Hu; Minfeng Zhu; Wei Chen

arXiv:2601.10094·cs.CV·January 16, 2026

V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen

PDF

Open Access

TL;DR

V-Zero introduces a self-improving framework for multimodal reasoning that enhances vision-language models using only unlabeled images through a co-evolutionary loop of question generation and solving.

Contribution

It presents a novel self-supervised training method for multimodal models that eliminates the need for human annotations by iteratively improving question generation and answering.

Findings

01

Achieves performance gains on Qwen2.5-VL-7B-Instruct without human labels.

02

Improves visual mathematical reasoning by +1.7 points.

03

Enhances general vision-centric tasks by +2.6 points.

Abstract

Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning