Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled

TL;DR
Pyramid MoA introduces a hierarchical probabilistic framework for cost-effective, anytime inference in large language models, with provable guarantees and broad applicability across benchmarks.
Contribution
It formalizes the connection between LLM cascading and classical anytime algorithms, proposing a decision-theoretic router with monotonicity guarantees and extending Value of Computation theory.
Findings
Intercepts 81.6% of bugs on MBPP
Nearly matches Oracle accuracy on GSM8K/MMLU with up to 42.9% compute savings
Transfers zero-shot to unseen benchmarks with significant cost reductions
Abstract
We observe that LLM cascading and routing implicitly solves an anytime computation problem -- a class of algorithms, well-studied in classical AI, that improve solutions as additional computation is allocated. We formalize this connection and propose Pyramid MoA, a hierarchical Mixture-of-Agents architecture governed by a decision-theoretic router that escalates queries only when necessary. We establish a Probabilistic Anytime Property with provable monotonicity guarantees and derive a generalized escalation rule from Value of Computation theory that accounts for imperfect oracles, extending the Hansen-Zilberstein monitoring framework to stochastic LLM inference. On MBPP, the router intercepts 81.6% of bugs; on GSM8K/MMLU, the system nearly matches the 68.1% Oracle baseline while achieving up to 42.9% compute savings. The router transfers zero-shot to unseen benchmarks: matching Oracle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
