Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance
Adarsh MS, Jithin VG, Ditto PS

TL;DR
This paper introduces a reward-based hybrid inference method that selectively involves cloud LLMs during token generation, reducing costs while maintaining high response quality.
Contribution
It proposes a dynamic, reward-driven mechanism for hybrid inference that minimizes cloud LLM usage without sacrificing performance.
Findings
Significantly reduces cloud LLM traffic
Maintains high response quality with fewer cloud calls
Offers flexible control over inference cost and quality
Abstract
Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Access Control and Trust · Digital Rights Management and Security
