Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection
Andrea Morandi

TL;DR
This paper introduces a Wald-SPRT based compute governor for multi-agent LLM debates, enabling adaptive stopping that reduces computational cost while maintaining high accuracy.
Contribution
It adapts Wald's SPRT as a plug-in monitor for LLM debates, providing error guarantees and calibration methods for efficient, adaptive decision-making.
Findings
Average debate rounds reduced by 3.7x on GSM8K
Achieved 97.0% accuracy with fewer calls compared to fixed rounds
Calibrated KL divergence collapses, indicating effective stopping rules
Abstract
Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
