TL;DR
This paper explores reinforcement learning for large language model-based multi-agent systems using orchestration traces, focusing on reward design, credit assignment, and decision decomposition, and releases related artifacts.
Contribution
It introduces a structured analysis of RL for LLM multi-agent orchestration, identifying key technical axes and connecting academic methods with industrial evidence.
Findings
Identified eight reward families for orchestration tasks.
Mapped RL credit signals from token to team level.
Decomposed orchestration learning into five key decisions.
Abstract
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
