TL;DR
This paper introduces Group Causal Policy Optimization (GCPO), a novel method that incorporates causal structure into large language model fine-tuning, improving response quality by modeling semantic interactions among candidate responses.
Contribution
The paper proposes GCPO, integrating causal modeling into policy optimization for LLMs, addressing limitations of previous methods that ignore response interactions.
Findings
GCPO outperforms existing methods like GRPO on reasoning benchmarks.
Causal projection improves response prediction quality.
Incorporating causal structure enhances policy optimization effectiveness.
Abstract
Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally informed subspace improves prediction quality, and (2) this projection yields a better baseline than query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
