Agreement-Based Cascading for Efficient Inference
Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith

TL;DR
Agreement-Based Cascading (ABC) is an adaptive inference method that uses model agreement to efficiently route data through a cascade of models, reducing costs and maintaining accuracy across various deployment scenarios.
Contribution
This paper introduces ABC, a novel adaptive inference technique that leverages model agreement for efficient data routing, outperforming existing cascading methods in cost and accuracy.
Findings
ABC reduces communication costs by up to 14x in edge-to-cloud inference.
ABC achieves a 3x reduction in rental costs for cloud-based model serving.
ABC reduces average price per token/request by 2-25x compared to state-of-the-art LLM cascades.
Abstract
Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Neural Networks and Applications · Generative Adversarial Networks and Image Synthesis
MethodsApproximate Bayesian Computation
