KV Prediction for Improved Time to First Token

Maxwell Horton; Qingqing Cao; Chenfan Sun; Yanzi Jin; Sachin Mehta,; Mohammad Rastegari; Moin Nabi

arXiv:2410.08391·cs.CL·October 14, 2024

KV Prediction for Improved Time to First Token

Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta,, Mohammad Rastegari, Moin Nabi

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces KV Prediction, a method that uses an auxiliary model to approximate the KV cache, significantly reducing the time to first token in transformer models while maintaining accuracy.

Contribution

The paper presents a novel KV Prediction technique that improves initial inference speed for large language models without sacrificing accuracy.

Findings

01

Achieves 15-50% accuracy improvement on TriviaQA.

02

Up to 30% accuracy boost on HumanEval Python code completion.

03

Demonstrates TTFT speedup on Apple M2 Pro hardware.

Abstract

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model's outputs. To reduce the time spent producing the first output (known as the ``time to first token'', or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

The idea of KV cache prediction based on Auxiliary model seems unique, though there are potential issues (as described in the weakness).

Weaknesses

1. The code has not been released yet! However, the authors claimed that as a part of their contributions. Anyway, releasing code can not be inferred as a technical contribution. 2. The paper mainly focuses on works of token eviction to justify the claim of TTFT increase. However, there are works like KV cache quantization example, KIVI [1], GEAR [2], for which this may not be true always under all settings. Additionally, for complex reasoning tasks KV quantization have shown more promise as op

Reviewer 02Rating 5Confidence 3

Strengths

Reducing the time to the first token is a crucial and challenging problem. This paper conducts extensive experiments and makes significant efforts to design KV prediction, which demonstrates effectiveness in achieving a Pareto-optimal trade-off between efficiency and accuracy.

Weaknesses

1. Based on the experimental results, the base model performance can be easily affected after applying the proposed method. Compared to other methods that can indirectly reduce time to the first token, such as quantization, is the Pareto curve of KV prediction still considered optimal? 2. Could the authors provide additional performance results on challenging reasoning tasks (e.g., GSM8K/MATH) to assess whether this method impacts the base model's performance?

Reviewer 03Rating 5Confidence 3

Strengths

- Proposes a new method for accelerating the prefilling stage of LLM, using a smaller model - Gives detailed explanation of the training process.

Weaknesses

- The method is only tested on OpenELM family models, thus the generalizability remains a question. This is the major concern. - For the CPU inference, shouldn’t we also consider TPOT? From Table 3, there are some improvements over auxiliary-only and base models, but they also improve TPOT. With this consideration, is low TTFT but high TPOT worthy? How about the total execution time? - For HumanEval, the advantages are not as obvious as in TriviaQA, as shown in Figure 4. This makes the generaliz

Reviewer 04Rating 5Confidence 4

Strengths

1. The algorithm is clearly defined and well explained. 2. The efficiency-accuracy trade-off is well studied through empirical experiments.

Weaknesses

1. At line 066, the paper claims that "We release our code". However, there is no link to the code. Could the authors provide the like or clarify the code release plan? 2. Possible incorrect estimation of FLOPs: In Secture 4.4 Line 303 claims that "The FLOPs-per-token compute cost of transformers inference can be estimated as 2P, where P is the number of parameters in the model". However the caculation in Kaplan et al., 2020 is $C = 2P+2n_{layer}d_{attn}N$ because $n_{ctx}=N$. The second term ca

Code & Models

Repositories

apple/corenet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTunneling and Rock Mechanics · Image and Object Detection Techniques

MethodsBalanced Selection