First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

Lai Wei; Yuting Li; Chen Wang; Yue Wang; Linghe Kong; Weiran Huang; Lichao Sun

arXiv:2505.22453·cs.CL·October 28, 2025

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun

PDF

2 Repos 1 Models 5 Datasets 1 Video

TL;DR

This paper introduces MM-UPT, an unsupervised post-training framework that enables continual self-improvement of multi-modal large language models without external labels, significantly enhancing reasoning abilities through a simple, scalable approach.

Contribution

The paper presents MM-UPT, a novel unsupervised post-training method that improves MLLMs' reasoning skills without requiring annotated data, advancing beyond traditional supervised and reinforcement learning techniques.

Findings

01

Improved reasoning accuracy on MathVista and We-Math datasets.

02

Effective self-generated data strategies enhance model performance.

03

Demonstrated scalability of unsupervised self-improvement methods.

Abstract

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. This limitation has motivated a growing interest in unsupervised paradigms as a third stage of post-training after SFT and RL. While recent efforts have explored this direction, their methods are complex and difficult to iterate. To address this, we propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs, enabling continual self-improvement without any external supervision. The training method of MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that such training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
WaltonFuture/Qwen2.5-VL-7B-MM-UPT-MMR1
model· 352 dl· ♡ 3
352 dl♡ 3

Datasets

Videos

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training· slideslive