Mars-PO: Multi-Agent Reasoning System Preference Optimization
Xiaoxuan Lou, Chaojie Wang, Bo An

TL;DR
Mars-PO is a multi-agent framework that enhances large language models' mathematical reasoning by combining outputs from multiple agents to create robust training data, significantly improving benchmark performance.
Contribution
The paper introduces Mars-PO, a novel multi-agent reasoning system that constructs preference pairs for training, leading to substantial improvements in mathematical reasoning accuracy of LLMs.
Findings
Increases Llama3.1-8B-Instruct accuracy on MATH benchmark from 50.38% to 57.82%.
Outperforms supervised fine-tuning and DPO baselines.
Demonstrates consistent performance gains across benchmarks.
Abstract
Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Spacecraft Design and Technology
MethodsSparse Evolutionary Training · Direct Preference Optimization
