First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training
Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun

TL;DR
This paper introduces MM-UPT, an unsupervised post-training framework that enables continual self-improvement of multi-modal large language models without external labels, significantly enhancing reasoning abilities through a simple, scalable approach.
Contribution
The paper presents MM-UPT, a novel unsupervised post-training method that improves MLLMs' reasoning skills without requiring annotated data, advancing beyond traditional supervised and reinforcement learning techniques.
Findings
Improved reasoning accuracy on MathVista and We-Math datasets.
Effective self-generated data strategies enhance model performance.
Demonstrated scalability of unsupervised self-improvement methods.
Abstract
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. This limitation has motivated a growing interest in unsupervised paradigms as a third stage of post-training after SFT and RL. While recent efforts have explored this direction, their methods are complex and difficult to iterate. To address this, we propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs, enabling continual self-improvement without any external supervision. The training method of MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that such training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- WaltonFuture/geometry3k-in-context-synthesizingdataset· 24 dl24 dl
- WaltonFuture/geometry3k-direct-synthesizingdataset· 14 dl14 dl
- WaltonFuture/GeoQA-8K-in-context-synthesizingdataset· 10 dl10 dl
- WaltonFuture/GeoQA-8K-direct-synthesizingdataset· 9 dl9 dl
- WaltonFuture/MMR1-in-context-synthesizingdataset· 28 dl28 dl
Videos
