Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Zhongyi Li; Wan Tian; Yikun Ban; Jinju Chen; Huiming Zhang; Yang Liu; Fuzhen Zhuang

arXiv:2603.21563·cs.AI·March 24, 2026

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang

PDF

Open Access

TL;DR

This paper introduces CCPO, a novel method for multi-agent reinforcement learning that improves credit assignment by estimating individual contributions through counterfactual trajectories, enhancing collaboration among large language models.

Contribution

CCPO is a new framework that assigns agent-specific learning signals using counterfactual baselines, improving credit assignment and collaboration in multi-agent LLM systems.

Findings

01

CCPO outperforms existing multi-agent RL baselines on reasoning benchmarks.

02

It reduces free-riding and enhances policy optimization.

03

The method is effective across different collaboration topologies.

Abstract

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Explainable Artificial Intelligence (XAI)