DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Ming Ma; Jue Zhang; Fangkai Yang; Yu Kang; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang

arXiv:2512.06749·cs.AI·February 3, 2026

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

PDF

Open Access 3 Reviews

TL;DR

DoVer introduces an intervention-driven debugging framework for LLM multi-agent systems, actively verifying hypotheses through targeted interventions to improve failure resolution and system reliability.

Contribution

It presents a novel intervention-based approach that enhances debugging accuracy and effectiveness beyond traditional log analysis in multi-agent LLM systems.

Findings

01

Flips 18-28% of failed trials into successes

02

Achieves up to 16% milestone progress

03

Validates or refutes 30-60% of failure hypotheses

Abstract

Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

The in-depth examination of the Who&When dataset in Section 3 provides valuable insights into the system’s behavior, helping clarify when and why failures occur. The framework’s ability to transform failed trials into useful learning opportunities demonstrates its robustness and practical value for improving multi-agent reliability.

Weaknesses

The evaluation relies solely on a hand-crafted subset of the Who&When dataset, which limits the generalizability of the findings and raises questions about how well DoVer would perform on larger or more diverse real-world datasets. The framework is evaluated only with GPT-4o as the backend model, which restricts understanding of DoVer’s effectiveness across different LLM architectures and limits claims about its general applicability.

Reviewer 02Rating 4Confidence 3

Strengths

The concrete and detailed motivating analysis of log-based failure attribution in Section 3 is quite informative in exposing how existing benchmarks and metrics could be insufficient and/or inaccurate for evaluating agent failure attribution.

Weaknesses

My major concern is that there is no baseline in current experiments, making it really difficult to interpret how good the numbers in Table 2 are and the proposed method is. Specifically, the proposed intervention-based debugging system is essentially doing self-refinement in some sense. Therefore, I believe reasonable baselines could include self-improvement techniques such as Self-Refine (Madaan et al., 2023) and CRITIC (Gou et al., 2024) which provide feedbacks to the agentic system to impro

Reviewer 03Rating 6Confidence 4

Strengths

1. Originality: Creative shift from log-only evaluation to active debugging paradigm. 2. I like the idea of the Insightful finding that ground-truth failure labels are inherently ambiguous in multi-agent setting; this is very valuable observation, e.g., “multiple trials per session and inter-agent misalignment make single-step annotation ill-posed”. 3. Trial segmentation idea and use of checkpoint replay is practical and generalizable. 4. Instead of just pointing at logs, they actually interven

Weaknesses

1. Although they use realistic benchmarks, dataset size is not huge (~100 cases, ~200 trials). I worry behavior may differ when: tasks have interactive state, not only web browsing, open source weaker agents used (GPT-4o/5 are strong babysitters) 2. All experiments on Magentic-One + AGDebugger. What about the agents from other frameworks? Would trial segmentation generalize?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Topic Modeling · Text Readability and Simplification