When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

Tianjie Ju; Bowen Wang; Hao Fei; Mong-Li Lee; Wynne Hsu; Yun Li; Qianren Wang; Pengzhou Cheng; Zongru Wu; Haodong Zhao; Zhuosheng Zhang; Gongshen Liu

arXiv:2502.15153·cs.CL·October 3, 2025

When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Haodong Zhao, Zhuosheng Zhang, Gongshen Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper explores how disagreements among large language model agents can enhance robustness and self-repair in collaborative tasks, revealing that disagreement type influences success and solution flexibility.

Contribution

It demonstrates that general disagreements promote exploration and robustness, while task-critical disagreements can hinder reasoning but have limited impact on programming tasks.

Findings

01

General disagreements improve success through exploration.

02

Task-critical disagreements reduce success in reasoning tasks.

03

Agents often bypass edited facts in programming, enabling self-repair.

Abstract

Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi-agent systems (MAS). However, it remains unclear how disagreements shape collective decision-making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task-critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE-cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The distinction between general disagreements (beneficial diversity) and task-critical disagreements (potentially harmful conflicts) is well-motivated and provides a useful lens for understanding MAS robustness. - The use of knowledge editing methods (IKE, ROME, MEND) to inject controlled disagreements is novel and allows for reproducible experiments. - The paper tests multiple models (LLaMA, Qwen, InternLM) across multiple datasets with multiple metrics, providing reasonable breadth.

Weaknesses

The paper uses AutoGen for collaborative programming but doesn't compare against ChatDev [1] or MetaGPT [2], which are established frameworks specifically designed for multi-agent software development. These frameworks have been shown to significantly outperform simpler multi-agent setups and would be natural baselines. For instance, ChatDev achieves quality scores of 0.3953 compared to 0.1523 for MetaGPT on software development benchmarks through its cooperative communication method [1]. The

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper effectively operationalizes its hypotheses by contrasting collaborative reasoning with collaborative programming. The use of knowledge editing (ROME, MEND, IKE) to inject controlled, task-critical disagreements is robust. 2. The trace analysis for RQ3 provides evidence of the self-repair mechanism the authors hypothesize. The example in Table 6, where the MAS avoids the edited append() function by using a different implementation, clearly demonstrates this rerouting capability in p

Weaknesses

1. The paper frames the task topology as a binary (single-path vs. multi-path). This distinction, while useful for the experiment, may be an oversimplification. Many complex reasoning tasks might admit multiple evidential paths, and some programming tasks may have only one optimal solution. 2. While the paper demonstrates that self-repair occurs, it does not analyze the mechanism of this repair. The trace analysis shows what happens, but not the communicative or reasoning dynamics of how the age

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper presents analyses through a path-aware view of robustness, and the distinction between single-path and multi-path tasks provides useful findings for understanding MAS behavior. 2. The experimental design is rigorous, with controlled manipulation of disagreements through knowledge editing. 3. The findings and analyses have practical implications for MAS design and challenge the assumption that all disagreements are harmful. The self-repair capability is an interesting emergent phe

Weaknesses

**1. Limitations in mechanistic understanding:** While the paper documents self-repair behavior, it provides limited insight into why it occurs. **2. Limitations in using knowledge editing for disagreements**: The use of knowledge editing, especially parametric methods like ROME, may not faithfully represent naturally occurring disagreements in deployed systems. **3. Limitations in theoretical grounding**: The formalization in Section 2 is intuitive but lacks rigor. For example, there seems

Code & Models

Repositories

wbw625/multiagentrobustness
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Statistical and Computational Modeling · Law, Economics, and Judicial Systems

MethodsMixing Adam and SGD