Mars-PO: Multi-Agent Reasoning System Preference Optimization

Xiaoxuan Lou; Chaojie Wang; Bo An

arXiv:2411.19039·cs.AI·December 2, 2024

Mars-PO: Multi-Agent Reasoning System Preference Optimization

Xiaoxuan Lou, Chaojie Wang, Bo An

PDF

Open Access

TL;DR

Mars-PO is a multi-agent framework that enhances large language models' mathematical reasoning by combining outputs from multiple agents to create robust training data, significantly improving benchmark performance.

Contribution

The paper introduces Mars-PO, a novel multi-agent reasoning system that constructs preference pairs for training, leading to substantial improvements in mathematical reasoning accuracy of LLMs.

Findings

01

Increases Llama3.1-8B-Instruct accuracy on MATH benchmark from 50.38% to 57.82%.

02

Outperforms supervised fine-tuning and DPO baselines.

03

Demonstrates consistent performance gains across benchmarks.

Abstract

Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Spacecraft Design and Technology

MethodsSparse Evolutionary Training · Direct Preference Optimization