Deliberation in Latent Space via Differentiable Cache Augmentation

Luyang Liu; Jonas Pfeiffer; Jiaxing Wu; Jun Xie; Arthur Szlam

arXiv:2412.17747·cs.CL·December 24, 2024

Deliberation in Latent Space via Differentiable Cache Augmentation

Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

PDF

Open Access 1 Video

TL;DR

This paper introduces a differentiable cache augmentation method for large language models, allowing offline enhancement of the model's key-value cache to improve reasoning and reduce perplexity without modifying the core decoder.

Contribution

The authors propose a novel offline, differentiable cache augmentation technique that enhances LLM reasoning capabilities by augmenting the key-value cache with latent embeddings, trained separately from the decoder.

Findings

01

Cache augmentation reduces perplexity on subsequent tokens.

02

The method improves performance on reasoning-intensive tasks.

03

The approach operates asynchronously without modifying the decoder.

Abstract

Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Deliberation in Latent Space via Differentiable Cache Augmentation· slideslive

Taxonomy

TopicsVideo Analysis and Summarization

MethodsSparse Evolutionary Training