S-GRPO: Unified Post-Training for Large Vision-Language Models
Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu

TL;DR
This paper introduces S-GRPO, a unified post-training method for large vision-language models that combines supervised fine-tuning and reinforcement learning to improve domain adaptation and efficiency.
Contribution
S-GRPO integrates imitation learning with preference optimization, introducing CGI to enhance exploration and accelerate convergence in visual tasks.
Findings
S-GRPO outperforms traditional methods in domain adaptation.
It accelerates convergence compared to SFT and RL.
It preserves general multimodal capabilities while adapting to new domains.
Abstract
Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
