Diving into Self-Evolving Training for Multimodal Reasoning

Wei Liu; Junlong Li; Xiwen Zhang; Fan Zhou; Yu Cheng; Junxian He

arXiv:2412.17451·cs.CL·June 9, 2025

Diving into Self-Evolving Training for Multimodal Reasoning

Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

PDF

Open Access

TL;DR

This paper introduces M-STAR, a novel framework for self-evolving multimodal reasoning that leverages reinforcement learning principles, identifies key training factors, and addresses performance saturation to improve model capabilities across benchmarks.

Contribution

It reframes self-evolving training for multimodal reasoning using RL concepts, analyzes critical factors, and proposes mechanisms to mitigate saturation, leading to consistent performance improvements.

Findings

01

Identifies three pivotal factors: Training Method, Reward Model, Prompt Variation.

02

Develops an automatic balancing mechanism to address saturation.

03

Achieves significant performance gains across models and benchmarks.

Abstract

Self-evolving trainin--where models iteratively learn from their own outputs--has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: Training Method, Reward Model, and Prompt Variation. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · Natural Language Processing Techniques

MethodsSparse Evolutionary Training