Context-Aware Assistant Selection for Improved Inference Acceleration   with Large Language Models

Jerry Huang; Prasanna Parthasarathi; Mehdi Rezagholizadeh; Sarath; Chandar

arXiv:2408.08470·cs.LG·December 17, 2024

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath, Chandar

PDF

Open Access 1 Video

TL;DR

This paper proposes a context-aware selection method for draft models in assisted decoding of large language models, improving inference speed and domain adaptation without prior model knowledge.

Contribution

It introduces a novel offline training approach using output alignment to select optimal draft models, enhancing inference efficiency across multiple domains.

Findings

01

Offline training with output alignment accelerates inference.

02

Effective draft model selection improves domain adaptation.

03

Method is flexible with multiple assisted decoding candidates.

Abstract

Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model's generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Recommender Systems and Techniques