Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang; Jue Wang; Ben Athiwaratkun; Ce Zhang; James Zou

arXiv:2406.04692·cs.CL·June 10, 2024·23 cites

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou

PDF

Open Access 3 Repos 3 Reviews

TL;DR

This paper introduces a Mixture-of-Agents (MoA) architecture that combines multiple large language models in layered structures, significantly improving performance on various benchmarks over existing models like GPT-4 Omni.

Contribution

The paper proposes a novel layered MoA architecture that leverages multiple LLMs collectively, achieving state-of-the-art results with open-source models.

Findings

01

MoA surpasses GPT-4 Omni on key benchmarks.

02

Open-source LLMs with MoA outperform proprietary models.

03

MoA achieves 65.1% on AlpacaEval 2.0, outperforming previous methods.

Abstract

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 4

Strengths

1. To the best of my knowledge, the proposed framework is both novel and reasonable. MoA can be viewed as a specific method for combining multiple weaker models to create a stronger model. 2. The model's performance is competitive, showing improvements over GPT-4 Omni on three benchmarks.

Weaknesses

1. The proposed model is more resource-intensive than single LLM-based models. 2. Most evaluations use LC metrics, with only a limited evaluation on MATH tasks included in the appendix. Further evaluations on diverse tasks are necessary to illustrate the general advantages of the proposed method. 3. An important question is how to select the set of proposal LLMs. Currently, the paper demonstrates two setups: one with relatively large models and one with smaller models. However, there is n

Reviewer 02Rating 8Confidence 4

Strengths

+ Proposal of a new effective framework to employ the collective intelligence of multiple LLMs. + Empirical evaluation on AlpacaEval 2.0, Arena-Hard, and MT-Bench verifies the effectiveness of the proposed solution.

Weaknesses

+ Stacking LLMs into layers and revising the output obtained from previous layers seems like another form of model ensemble and I would suggest including model ensemble as one of the comparative methods. + In Figure 6, the max number of tflops among proposers in each MoA layer is used as an approximation of the total tflops of the entire layers since different proposers can run in a parallel way. However, the approximation is only reasonable when considering the inference latency for a single qu

Reviewer 03Rating 8Confidence 4

Strengths

- While collaborativeness has been harnessed in various ways, a layered funnel architecture in which earlier layers add information for later layers to consume and interplay to summarize these outputs efficiently to yield a final output has not been explored. - The authors also thoroughly conducted their experiments to establish collaborativeness and benchmark various datasets. The usage of open-source models to showcase the results helps to make replicating these results possible. - They al

Weaknesses

- The idea of collaborativeness or hierarchical processing in LLMs is not exactly novel [1][2]; if you think of different layers in the architecture using the same model, this reduces to some form of iterative refinement of outputs as shown in [2]. - Some of the analysis in the paper to support budget analysis is unclear. #### *References* 1] [2308.10848] AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors 2] LEGO: A Multi-agent Collaborative Framework with

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer