Efficiently Deploying LLMs with Controlled Risk
Michael J. Zellinger, Matt Thomson

TL;DR
This paper introduces HCMA, a hierarchical framework for deploying large language models efficiently with effective risk control, using uncertainty-based query delegation and calibration techniques to optimize performance and safety.
Contribution
The paper presents HCMA, a novel, training-free method for risk-aware LLM deployment that leverages model uncertainty and simple calibration to improve efficiency and safety.
Findings
HCMA reduces error rate by 30% on MMLU with 20% abstention.
Calibration with logistic regressions achieves low calibration error with minimal labeled data.
Zero-shot prompting eliminates errors on TruthfulQA at high abstention rates.
Abstract
Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The proposed HCMA method operates independently of model weights, which allows it to function within API-based LLM query setups.
### Unclear Motivation - The motivation for a method that addresses both efficiency and risk control in LLM deployment simultaneously is not clearly explained. It is unclear why existing methods addressing efficiency or risk control separately are insufficient. - The rationale behind the HCMA approach requires clarification. - The paper would benefit from a stronger scientific argument that demonstrates a common challenge in efficiency and risk control in LLM deployment, justifying the simultane
1. The related work is covered in great detail. 2. This paper tries to reduce the cost by utilizing smaller LLMs if they answer the given query correctly rather than always using larger LLMs for each query. They delegate the more difficult queries to larger LLMs or abstain from answering the query altogether if they are not confident enough.
While the problem they are tackling is quite relevant, the paper lacks sufficient experiments and baselines to demonstrate the efficacy of the proposed method. I have listed a few of my concerns below. 1. How does the modified Platt scaling work in comparison to other uncertainty quantification and probability calibration techniques such as semantic entropy (Kuhn et al.), P_true (Kadavath et al., 2022), Eigen values, Degree, Eccentricity (Lin et al. 2024) and other works listed in the uncertain
* The topic of using model cascades to cut costs is of practical relevance. * The paper uses cross-validation across 100-500 seeds.
* The font size is reduced and the margins are made smaller. This is a potential breach of the ICLR guidelines. * Conceptually, I cannot follow why it is required to recalibrate the LLM token / P(True) probabilities via a logistic regression with a nonlinear transformation of probabilities. All that the method uses in the end are the two threshold values. Since all transformations are monotonic, the thresholds could have also been computed for original probabilities. * There is no comparison a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Optimization and Search Problems · Scheduling and Optimization Algorithms
MethodsSoftmax · Attention Is All You Need · Focus
