COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams
Shripad Deshmukh, Jayakumar Subramanian, Raghavendra Addanki, Nikos Vlassis

TL;DR
COSAC introduces a critic-free policy gradient method for sequential cooperative multi-agent systems, improving credit assignment efficiency and scalability while demonstrating superior performance on benchmark tasks.
Contribution
It proposes a novel additive reward decomposition and counterfactual advantage computation that extend the aristocrat utility to sequential teams, with theoretical bias-variance guarantees.
Findings
COSAC achieves lowest advantage MSE in sequential bandits.
It demonstrates faster convergence than critic-free baselines on the ARC task.
COSAC scales effectively to teams of up to 16 agents.
Abstract
In cooperative teams where agents act in a fixed order and share a single team-level reward (multi-agent language systems, sequential robotic tasks), per-agent credit assignment is under-determined. Critic-based approaches scale poorly as the number of agents grows owing to the costly maintenance of joint/factored critic(s), whereas the existing critic-free alternatives have other issues: common credit across agents that couples every agent's signal to teammate noise, importance-sampling corrections for upstream-update staleness that incur variance exponential in team size, or per-agent counterfactual replay that isolates each agent's effect at the price of extra environment or reward calls. We propose COSAC, a critic-free per-agent policy gradient for sequential cooperative teams. COSAC fits an additive per-agent decomposition of the team reward by a single ridge regression on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
