Group Causal Policy Optimization for Post-Training Large Language Models

Ziyin Gu; Jingyao Wang; Ran Zuo; Chuxiong Sun; Zeen Song; Changwen Zheng; Wenwen Qiang

arXiv:2508.05428·cs.LG·August 8, 2025

Group Causal Policy Optimization for Post-Training Large Language Models

Ziyin Gu, Jingyao Wang, Ran Zuo, Chuxiong Sun, Zeen Song, Changwen Zheng, Wenwen Qiang

PDF

1 Video

TL;DR

This paper introduces Group Causal Policy Optimization (GCPO), a novel method that incorporates causal structure into large language model fine-tuning, improving response quality by modeling semantic interactions among candidate responses.

Contribution

The paper proposes GCPO, integrating causal modeling into policy optimization for LLMs, addressing limitations of previous methods that ignore response interactions.

Findings

01

GCPO outperforms existing methods like GRPO on reasoning benchmarks.

02

Causal projection improves response prediction quality.

03

Incorporating causal structure enhances policy optimization effectiveness.

Abstract

Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally informed subspace improves prediction quality, and (2) this projection yields a better baseline than query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Group Causal Policy Optimization for Post-Training Large Language Models· underline