Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort,, Yuki Asano, Babak Ehteshami Bejnordi

TL;DR
This paper introduces a hybrid LLM-SLM approach that leverages a large frozen LLM to guide a smaller language model, significantly reducing decoding time while maintaining high task performance.
Contribution
It presents a novel method combining large frozen LLMs with small language models, requiring only fine-tuning of the SLM for faster autoregressive decoding.
Findings
Achieves up to 4x speedup in decoding.
Maintains 98-99% performance on translation and summarization.
Applicable to various model architectures and tasks.
Abstract
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
