Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Sudong Wang; Weiquan Huang; Xiaomin Yu; Zuhao Yang; Hehai Lin; Keming Wu; Chaojun Xiao; Chen Chen; Wenxuan Wang; Beier Zhu; Yunjian Zhang; Chengwei Qin

arXiv:2604.28123·cs.CV·May 5, 2026

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin

PDF

1 Repo 6 Models 3 Datasets

TL;DR

PRISM introduces a three-stage pipeline with explicit distribution alignment to improve multimodal large language models' performance after supervised fine-tuning and reinforcement learning, addressing distributional drift issues.

Contribution

It proposes a novel black-box on-policy distillation method with a Mixture-of-Experts discriminator for effective distribution alignment in multimodal RL models.

Findings

01

PRISM improves downstream RLVR performance on multimodal benchmarks.

02

It achieves +4.4 and +6.0 points accuracy gains over baseline on 4B and 8B models.

03

The method is effective across multiple RL algorithms and diverse tasks.

Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XIAO4579/PRISM
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.