Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner; Andrii Skliar; Amelie Royer; Tijmen Blankevoort,; Yuki Asano; Babak Ehteshami Bejnordi

arXiv:2402.16844·cs.LG·July 18, 2024·1 cites

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort,, Yuki Asano, Babak Ehteshami Bejnordi

PDF

Open Access

TL;DR

This paper introduces a hybrid LLM-SLM approach that leverages a large frozen LLM to guide a smaller language model, significantly reducing decoding time while maintaining high task performance.

Contribution

It presents a novel method combining large frozen LLMs with small language models, requiring only fine-tuning of the SLM for faster autoregressive decoding.

Findings

01

Achieves up to 4x speedup in decoding.

02

Maintains 98-99% performance on translation and summarization.

03

Applicable to various model architectures and tasks.

Abstract

Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques