Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata, Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah

TL;DR
This paper introduces a hybrid inference method that intelligently routes queries between small and large language models to reduce costs by up to 40% without sacrificing response quality.
Contribution
It presents a dynamic routing approach based on query difficulty and quality needs, enabling cost-efficient deployment of LLMs with maintained response quality.
Findings
Up to 40% reduction in large model calls
Maintains response quality despite cost savings
Dynamic quality-cost trade-off at test time
Abstract
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Peer Reviews
Decision·ICLR 2024 poster
- Paper is well written, and authors do a good job of building upon concepts used in final technique. - Ablations and analysis are extensive and well-thought out, giving researchers ample inspiration to build upon this technique. - The analysis of performance on different model size pairs is interesting to me.
Please cite these works: - https://arxiv.org/abs/2305.05176 - routing on a query level - https://arxiv.org/abs/2211.17192, https://arxiv.org/abs/2302.07863 - latency reduction using small and big models I believe writing a discussion of the tradeoffs of these approaches would improve the current draft.
Paper presents a novel hybrid inference strategy designed to minimize the computational expense by limiting the number of queries to the larger model and utilizing smaller models to function as decision-making routers. Moreover, paper presents multiple different approaches to training the decision making routers and its effectiveness.
I have following major concerns. 1. **Reliability of BART scores for routing** I am uncertain about the efficiency of training the router model to decide whether the BART scores of the smaller model is similar to those of the larger one. BARTScore has demonstrated strong performance in extractive QA; however, its correlation may diminish in abstractive QA contexts [1], suggesting that the metric might not be suitable for assessing open-ended generation tasks. Establishing a correlation between
The paper sets the problem in the context of LLM inference and focuses on the evaluation of response quality and cost advantage. It defines metrics for measuring the effectiveness of the routing strategy, considering the intrinsic uncertainties in natural language processing tasks. The evaluation is conducted on the MixInstruct dataset, which comprises a diverse range of tasks such as question answering, summarization, and information extraction. The experimntal results demonstrate the efficacy
The main limitation of the paper seems to be its reliance on the assumptions about the quality gaps and the routiing mechanisms. These assumptions could potentially affect the overall effectiveness and efficiency of the routing process. Additionally, the reliance on specific models and the need for manual intervention in setting the threshold for routing may limit the scalability and generalizability of the proposed framework.
Videos
Taxonomy
TopicsCaching and Content Delivery · Software-Defined Networks and 5G · Energy Efficient Wireless Sensor Networks
