Training Language Models to Critique With Multi-agent Feedback

Tian Lan; Wenwei Zhang; Chengqi Lyu; Shuaibin Li; Chen Xu; Heyan; Huang; Dahua Lin; Xian-Ling Mao; Kai Chen

arXiv:2410.15287·cs.CL·October 22, 2024

Training Language Models to Critique With Multi-agent Feedback

Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan, Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MultiCritique, a multi-agent feedback pipeline that enhances LLMs' critique ability by aggregating high-quality critiques from multiple agents, leading to significant performance improvements over existing models.

Contribution

The paper presents a novel multi-agent feedback data generation pipeline, MultiCritique, improving critique quality and LLM performance beyond single-agent approaches.

Findings

01

Constructed a superior critique dataset compared to existing ones.

02

Enhanced critique ability of LLMs through multi-agent feedback and RL.

03

Fine-tuned 7B model approaches 70B LLM performance.

Abstract

Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 2

Strengths

- The paper is well-written and effectively presented, with technical details that are clearly explained and easy to follow. Figure 1 is particularly effective in illustrating the entire MultiCritique pipeline. - The evaluations and ablation studies are comprehensive, providing a thorough analysis of the model's performance and the impact of various components. - The MultiCritique SFT dataset, if made open-source, could be a good contribution to the field due to its careful construction aimed at

Weaknesses

- Dependence on GPT-4: The project heavily relies on GPT-4 for classifying and summarizing critiques, which may introduce biases into the final critique summaries. It is not hard to say that the entire project is feasible because of the existence of GPT-4. This reliance potentially undermines the benefits of using multiple agents in earlier steps. I know it might be a bit too much to ask, but it would be interesting to explore the impact of using a different model, such as Claude 3.5 Sonnet, to

Reviewer 02Rating 6Confidence 3

Strengths

- It is a Practically useful work - Good improvement in results in CriticEval and CriticBench benchmarks

Weaknesses

- Mostly this paper feels like just a bit of a formalisation for distilling “critiquing” ability from GPT-4 by generating critiquing data both for SFT and RL tuning. The data generation process is also quite heavily dependent on GPT-4 abilities. - Novelty of this work is not very clear to me

Reviewer 03Rating 6Confidence 3

Strengths

- overall addressing an important problem of developing small high qulaity critique model - new multi-agent feedbakc approach to curate dataset - creation of a new high qulaity multi-critique dataset, incorporating diverse queries,and crucial information collection - naturla integration and use of RL - Strong experimental valiadtion, showing 7b FT models perform at par or better with closed source and 70b model on criticeval and critic bench

Weaknesses

- not very technically novel, looks like a fairly stratightforward principle of ensembling for dataset collection - Performance is still limited by an upperbound of the the base LLMs, and it is not clear if we could sclae to 70b, would the performance be better

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout