MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, Fuxiao Liu

TL;DR
MM-Zero introduces a novel RL-based framework enabling vision-language models to self-evolve from zero data through a multi-role system involving proposing, coding, and solving, significantly advancing multimodal reasoning capabilities.
Contribution
This work presents the first zero-data self-evolving framework for VLMs using a multi-role setup and Group Relative Policy Optimization, extending self-improvement beyond dual-role paradigms.
Findings
Improves VLM reasoning performance across multiple benchmarks.
Demonstrates effective zero-data self-evolution in multimodal models.
Establishes a scalable multi-model self-evolving training framework.
Abstract
Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
