Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning
Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang,, Min Chen

TL;DR
This paper introduces CORY, a novel multi-agent reinforcement learning framework for fine-tuning large language models, which enhances performance, robustness, and reduces distribution collapse compared to traditional PPO methods.
Contribution
CORY extends RL fine-tuning of LLMs to a cooperative multi-agent setting with role exchange, improving stability and effectiveness over existing methods.
Findings
CORY outperforms PPO in policy optimality.
CORY demonstrates increased resistance to distribution collapse.
CORY shows improved training robustness.
Abstract
Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine-tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer's responses. The two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics · Digital Rights Management and Security
MethodsAttention Is All You Need · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Discriminative Fine-Tuning · Linear Layer · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding
