Consensus Group Relative Policy Optimization for Text Generation
Yuki Ichihara, Yuu Jinnai, Kaito Ariu, Eiji Uchibe

TL;DR
C-GRPO is a novel training method that distills the benefits of MBR decoding into a policy optimization framework, enabling efficient text generation without high inference costs or reliance on gold references.
Contribution
It introduces a group-relative policy optimization approach that aligns training with MBR decoding, reducing inference costs and eliminating the need for curated preference data.
Findings
Achieves MBR-level performance without inference overhead
Outperforms reference-free baselines in translation and summarization
Converges under ideal conditions to the expected-utility objective
Abstract
Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
