Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath, Chandar

TL;DR
This paper proposes a context-aware selection method for draft models in assisted decoding of large language models, improving inference speed and domain adaptation without prior model knowledge.
Contribution
It introduces a novel offline training approach using output alignment to select optimal draft models, enhancing inference efficiency across multiple domains.
Findings
Offline training with output alignment accelerates inference.
Effective draft model selection improves domain adaptation.
Method is flexible with multiple assisted decoding candidates.
Abstract
Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model's generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Recommender Systems and Techniques
