Cost-Saving LLM Cascades with Early Abstention
Michael J. Zellinger, Rex Liu, Matt Thomson

TL;DR
This paper explores early abstention in LLM cascades, allowing smaller models to abstain before expensive models are invoked, which reduces costs and errors while maintaining performance in risk-sensitive domains.
Contribution
It introduces and empirically evaluates the concept of early abstention in LLM cascades, demonstrating its benefits in cost reduction and error minimization across multiple benchmarks.
Findings
Early abstention reduces overall test loss by 2.2%.
It decreases costs by 13.0% and error rates by 5.0%.
Allows more effective use of abstention by leveraging error pattern correlations.
Abstract
LLM cascades deploy small LLMs to answer most queries, limiting the use of large and expensive LLMs to difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as finance or medicine place an additional premium on avoiding model errors. Since even the most expensive models are susceptible to making mistakes, applications in these domains benefit from allowing LLM systems to completely abstain from answering difficult queries. Introducing abstention poses a design question for LLM cascades: should abstention only be allowed at the final model or also at earlier models? Since the error patterns of small and large models are correlated, allowing earlier models to abstain may reduce inference costs and latency by anticipating abstention decisions by expensive and slow models, thus avoiding the need to run these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
