DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

TL;DR
DoVer introduces an intervention-driven debugging framework for LLM multi-agent systems, actively verifying hypotheses through targeted interventions to improve failure resolution and system reliability.
Contribution
It presents a novel intervention-based approach that enhances debugging accuracy and effectiveness beyond traditional log analysis in multi-agent LLM systems.
Findings
Flips 18-28% of failed trials into successes
Achieves up to 16% milestone progress
Validates or refutes 30-60% of failure hypotheses
Abstract
Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the…
Peer Reviews
Decision·ICLR 2026 Poster
The in-depth examination of the Who&When dataset in Section 3 provides valuable insights into the system’s behavior, helping clarify when and why failures occur. The framework’s ability to transform failed trials into useful learning opportunities demonstrates its robustness and practical value for improving multi-agent reliability.
The evaluation relies solely on a hand-crafted subset of the Who&When dataset, which limits the generalizability of the findings and raises questions about how well DoVer would perform on larger or more diverse real-world datasets. The framework is evaluated only with GPT-4o as the backend model, which restricts understanding of DoVer’s effectiveness across different LLM architectures and limits claims about its general applicability.
The concrete and detailed motivating analysis of log-based failure attribution in Section 3 is quite informative in exposing how existing benchmarks and metrics could be insufficient and/or inaccurate for evaluating agent failure attribution.
My major concern is that there is no baseline in current experiments, making it really difficult to interpret how good the numbers in Table 2 are and the proposed method is. Specifically, the proposed intervention-based debugging system is essentially doing self-refinement in some sense. Therefore, I believe reasonable baselines could include self-improvement techniques such as Self-Refine (Madaan et al., 2023) and CRITIC (Gou et al., 2024) which provide feedbacks to the agentic system to impro
1. Originality: Creative shift from log-only evaluation to active debugging paradigm. 2. I like the idea of the Insightful finding that ground-truth failure labels are inherently ambiguous in multi-agent setting; this is very valuable observation, e.g., “multiple trials per session and inter-agent misalignment make single-step annotation ill-posed”. 3. Trial segmentation idea and use of checkpoint replay is practical and generalizable. 4. Instead of just pointing at logs, they actually interven
1. Although they use realistic benchmarks, dataset size is not huge (~100 cases, ~200 trials). I worry behavior may differ when: tasks have interactive state, not only web browsing, open source weaker agents used (GPT-4o/5 are strong babysitters) 2. All experiments on Magentic-One + AGDebugger. What about the agents from other frameworks? Would trial segmentation generalize?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Topic Modeling · Text Readability and Simplification
