S-GRPO: Unified Post-Training for Large Vision-Language Models

Yuming Yan; Kai Tang; Sihong Chen; Ke Xu; Dan Hu; Qun Yu; Pengfei Hu

arXiv:2604.16557·cs.LG·April 21, 2026

S-GRPO: Unified Post-Training for Large Vision-Language Models

Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu

PDF

TL;DR

This paper introduces S-GRPO, a unified post-training method for large vision-language models that combines supervised fine-tuning and reinforcement learning to improve domain adaptation and efficiency.

Contribution

S-GRPO integrates imitation learning with preference optimization, introducing CGI to enhance exploration and accelerate convergence in visual tasks.

Findings

01

S-GRPO outperforms traditional methods in domain adaptation.

02

It accelerates convergence compared to SFT and RL.

03

It preserves general multimodal capabilities while adapting to new domains.

Abstract

Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.