GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment
Yao Yao, Zuchao Li, Hai Zhao

TL;DR
This paper introduces GKT, a guidance-based knowledge transfer framework that enhances LLM efficiency and accuracy without fine-tuning, enabling cost-effective cloud-edge deployment with significant speed and performance improvements.
Contribution
GKT is a novel, fine-tuning-free framework that uses a larger LLM as a guide to improve smaller models' responses, facilitating efficient and customizable cloud-edge LLM deployment.
Findings
Achieves up to 14.18% accuracy improvement and 10.72x speed-up on GSM8K.
Attains 95% of ChatGPT's performance at 52% of the cost using GKT.
Surpasses individual model performance in accuracy and speed on benchmark datasets.
Abstract
The burgeoning size of Large Language Models (LLMs) has led to enhanced capabilities in generating responses, albeit at the expense of increased inference times and elevated resource demands. Existing methods of acceleration, predominantly hinged on knowledge distillation, generally necessitate fine-tuning of considerably large models, such as Llama-7B, posing a challenge for average users. Furthermore, present techniques for expediting inference and reducing costs operate independently. To address these issues, we introduce a novel and intuitive Guidance-based Knowledge Transfer (GKT) framework. This approach leverages a larger LLM as a ''teacher'' to create guidance prompts, paired with a smaller ''student'' model to finalize responses. Remarkably, GKT requires no fine-tuning and doesn't necessitate the teacher and student models to have the same vocabulary, allowing for extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Service-Oriented Architecture and Web Services
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
