Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan, Tengyang Xie

TL;DR
This paper introduces DPSDP, a reinforcement learning approach that models multi-agent answer refinement as a Markov Decision Process, leading to improved reasoning accuracy of large language models through iterative feedback and collaboration.
Contribution
It proposes DPSDP, a novel RL algorithm for training multi-agent LLM systems to iteratively refine answers, addressing feedback limitations and enhancing reasoning performance.
Findings
DPSDP improves accuracy on in- and out-of-distribution benchmarks.
Majority voting over five steps increases first-turn accuracy from 58.2% to 63.2%.
Multi-agent collaboration benefits are confirmed through ablation studies.
Abstract
Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
MethodsBalanced Selection
