Deliberation in Latent Space via Differentiable Cache Augmentation
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

TL;DR
This paper introduces a differentiable cache augmentation method for large language models, allowing offline enhancement of the model's key-value cache to improve reasoning and reduce perplexity without modifying the core decoder.
Contribution
The authors propose a novel offline, differentiable cache augmentation technique that enhances LLM reasoning capabilities by augmenting the key-value cache with latent embeddings, trained separately from the decoder.
Findings
Cache augmentation reduces perplexity on subsequent tokens.
The method improves performance on reasoning-intensive tasks.
The approach operates asynchronously without modifying the decoder.
Abstract
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization
MethodsSparse Evolutionary Training
