Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection
Guillem Ram\'irez, Alexandra Birch, Ivan Titov

TL;DR
This paper introduces a simple uncertainty-based two-tier selection method for optimizing large language model calls, effectively balancing cost and performance without extra neural models, outperforming existing strategies in most tests.
Contribution
The paper proposes a novel, simpler approach using only the small LLM's uncertainty as a decision criterion, eliminating the need for additional neural models in LLM call optimization.
Findings
Outperforms existing methods in 25 out of 27 setups
Balances cost and performance effectively
Works across multiple LLM pairs and tasks
Abstract
Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a routing strategy is used, where only one model is ever called. Both scenarios are dependent on a decision criterion which is typically implemented by an extra neural model. In this work, we propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling
