Diving into Self-Evolving Training for Multimodal Reasoning
Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

TL;DR
This paper introduces M-STAR, a novel framework for self-evolving multimodal reasoning that leverages reinforcement learning principles, identifies key training factors, and addresses performance saturation to improve model capabilities across benchmarks.
Contribution
It reframes self-evolving training for multimodal reasoning using RL concepts, analyzes critical factors, and proposes mechanisms to mitigate saturation, leading to consistent performance improvements.
Findings
Identifies three pivotal factors: Training Method, Reward Model, Prompt Variation.
Develops an automatic balancing mechanism to address saturation.
Achieves significant performance gains across models and benchmarks.
Abstract
Self-evolving trainin--where models iteratively learn from their own outputs--has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: Training Method, Reward Model, and Prompt Variation. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
