Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin

TL;DR
PRISM introduces a three-stage pipeline with explicit distribution alignment to improve multimodal large language models' performance after supervised fine-tuning and reinforcement learning, addressing distributional drift issues.
Contribution
It proposes a novel black-box on-policy distillation method with a Mixture-of-Experts discriminator for effective distribution alignment in multimodal RL models.
Findings
PRISM improves downstream RLVR performance on multimodal benchmarks.
It achieves +4.4 and +6.0 points accuracy gains over baseline on 4B and 8B models.
The method is effective across multiple RL algorithms and diverse tasks.
Abstract
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-GRPOmodel· 33 dl33 dl
- 🤗prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-GSPOmodel· 49 dl49 dl
- 🤗prism-vlm/Qwen3-VL-4B-Instruct-SFT-PRISM-DAPOmodel· 42 dl42 dl
- 🤗prism-vlm/Qwen3-VL-8B-Instruct-SFT-PRISM-GSPOmodel· 41 dl41 dl
- 🤗prism-vlm/Qwen3-VL-8B-Instruct-SFT-PRISM-DAPOmodel· 40 dl40 dl
- 🤗prism-vlm/Qwen3-VL-8B-Instruct-SFT-PRISM-GRPOmodel· 47 dl47 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
