Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Arindam Khaled

arXiv:2602.19509·cs.CL·April 14, 2026

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Arindam Khaled

PDF

TL;DR

Pyramid MoA introduces a hierarchical probabilistic framework for cost-effective, anytime inference in large language models, with provable guarantees and broad applicability across benchmarks.

Contribution

It formalizes the connection between LLM cascading and classical anytime algorithms, proposing a decision-theoretic router with monotonicity guarantees and extending Value of Computation theory.

Findings

01

Intercepts 81.6% of bugs on MBPP

02

Nearly matches Oracle accuracy on GSM8K/MMLU with up to 42.9% compute savings

03

Transfers zero-shot to unseen benchmarks with significant cost reductions

Abstract

We observe that LLM cascading and routing implicitly solves an anytime computation problem -- a class of algorithms, well-studied in classical AI, that improve solutions as additional computation is allocated. We formalize this connection and propose Pyramid MoA, a hierarchical Mixture-of-Agents architecture governed by a decision-theoretic router that escalates queries only when necessary. We establish a Probabilistic Anytime Property with provable monotonicity guarantees and derive a generalized escalation rule from Value of Computation theory that accounts for imperfect oracles, extending the Hansen-Zilberstein monitoring framework to stochastic LLM inference. On MBPP, the router intercepts 81.6% of bugs; on GSM8K/MMLU, the system nearly matches the 68.1% Oracle baseline while achieving up to 42.9% compute savings. The router transfers zero-shot to unseen benchmarks: matching Oracle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.