Training Language Models to Critique With Multi-agent Feedback
Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan, Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

TL;DR
This paper introduces MultiCritique, a multi-agent feedback pipeline that enhances LLMs' critique ability by aggregating high-quality critiques from multiple agents, leading to significant performance improvements over existing models.
Contribution
The paper presents a novel multi-agent feedback data generation pipeline, MultiCritique, improving critique quality and LLM performance beyond single-agent approaches.
Findings
Constructed a superior critique dataset compared to existing ones.
Enhanced critique ability of LLMs through multi-agent feedback and RL.
Fine-tuned 7B model approaches 70B LLM performance.
Abstract
Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is well-written and effectively presented, with technical details that are clearly explained and easy to follow. Figure 1 is particularly effective in illustrating the entire MultiCritique pipeline. - The evaluations and ablation studies are comprehensive, providing a thorough analysis of the model's performance and the impact of various components. - The MultiCritique SFT dataset, if made open-source, could be a good contribution to the field due to its careful construction aimed at
- Dependence on GPT-4: The project heavily relies on GPT-4 for classifying and summarizing critiques, which may introduce biases into the final critique summaries. It is not hard to say that the entire project is feasible because of the existence of GPT-4. This reliance potentially undermines the benefits of using multiple agents in earlier steps. I know it might be a bit too much to ask, but it would be interesting to explore the impact of using a different model, such as Claude 3.5 Sonnet, to
- It is a Practically useful work - Good improvement in results in CriticEval and CriticBench benchmarks
- Mostly this paper feels like just a bit of a formalisation for distilling “critiquing” ability from GPT-4 by generating critiquing data both for SFT and RL tuning. The data generation process is also quite heavily dependent on GPT-4 abilities. - Novelty of this work is not very clear to me
- overall addressing an important problem of developing small high qulaity critique model - new multi-agent feedbakc approach to curate dataset - creation of a new high qulaity multi-critique dataset, incorporating diverse queries,and crucial information collection - naturla integration and use of RL - Strong experimental valiadtion, showing 7b FT models perform at par or better with closed source and 70b model on criticeval and critic bench
- not very technically novel, looks like a fairly stratightforward principle of ensembling for dataset collection - Performance is still limited by an upperbound of the the base LLMs, and it is not clear if we could sclae to 70b, would the performance be better
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
