Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang

TL;DR
This paper introduces CCPO, a novel method for multi-agent reinforcement learning that improves credit assignment by estimating individual contributions through counterfactual trajectories, enhancing collaboration among large language models.
Contribution
CCPO is a new framework that assigns agent-specific learning signals using counterfactual baselines, improving credit assignment and collaboration in multi-agent LLM systems.
Findings
CCPO outperforms existing multi-agent RL baselines on reasoning benchmarks.
It reduces free-riding and enhances policy optimization.
The method is effective across different collaboration topologies.
Abstract
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Explainable Artificial Intelligence (XAI)
