Compressed Context Memory For Online Language Model Interaction

Jang-Hyun Kim; Junyoung Yeom; Sangdoo Yun; Hyun Oh Song

arXiv:2312.03414·cs.LG·February 7, 2024·1 cites

Compressed Context Memory For Online Language Model Interaction

Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a compressed context memory system for Transformer language models that maintains high performance with significantly reduced memory usage, enabling efficient online interactions with unlimited context length.

Contribution

It proposes a novel recursive compression method integrated with conditional LoRA, allowing models to handle unlimited context efficiently without full fine-tuning.

Findings

01

Achieves full context model performance with 5x smaller memory.

02

Outperforms sliding window approach in streaming scenarios.

03

Supports unlimited context length in online language model interactions.

Abstract

This paper presents a context key/value compression method for Transformer language models in online scenarios, where the context continually expands. As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model. To address this challenge, we propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space, facilitating language model inference in a limited memory space of computing environments. Our compression process involves integrating a lightweight conditional LoRA into the language model's forward pass during inference, without the need for fine-tuning the model's entire set of weights. We achieve efficient training by modeling the recursive compression process as a single parallelized forward computation. Through…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The problem of efficiently handling expanding contexts is highly relevant given the online nature of systems like ChatGPT. The paper addresses an important open challenge. - The method is flexible and broadly applicable to diverse online inference scenarios like multi-task learning, personalization and conversation. - Empirical evaluations across three datasets substantiate the memory and computation advantages over baselines. The method achieves slightly lower performance than the full cont

Weaknesses

- The main limitation of the proposed compression framework is that it is task-specific. The compression module must be trained for each task, which requires additional data, computation, and cannot generalize to new tasks. This is a significant drawback in the context of foundation models which are trained on large datasets for general-purpose use. - There is still a obvious gap in performance between the compressed and full context models. The paper does not provide a clear explanation for thi

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The paper is overall sound. The method design is concise, effective, and efficient. Compared with retrieval-based method to re-compute the sentence embedding, the CCM can directly adopt the KV cache of introduced <COMP> token as the memory vector for one utterance and utilize them in further inference. To engage the LLM to utilize such CCM, the parallel training and LoRA adapter are designed well for efficient adaptation. 2. The CCM is efficient in both training and inference. Firstly, there

Weaknesses

1. The CCM method is not that novel and has been explored well in some important early milestones before the creation of Transformer, i.e., Memory Networks, Fast Weights to Attend Recent Past. The author should mention and discuss the relation with these methods. Additionally, the Compress Transformer should be briefly introduced as it is not a universally known preliminary for readers. 2. In terms of the baselines, in the main tables, CCM is only compared with “no context" and "full context" b

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- A interesting method to compress contexts in the few-shot learning setting. - The results evaluated in the few-shot learning tasks show the effectiveness and superiority over the conventional approaches like RMT and Gist.

Weaknesses

While this paper presents a seemingly promising solution to long contexts, I have significant concerns about several limitations. Firstly, one of the main focuses of this paper is handling dynamic context for interaction. Judging from its experimental design, it mainly conducts experiments with a fine-tuned LLM for few-shot learning scenarios, which are generally simpler tasks, all being multi-choice, or classification tasks. The methods primarily compared in this paper are general context comp

Code & Models

Repositories

snu-mllab/context-memory
pytorchOfficial

Videos

Compressed Context Memory for Online Language Model Interaction· slideslive

Taxonomy

TopicsSpeech and dialogue systems · Context-Aware Activity Recognition Systems · Robotics and Automated Systems

MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Absolute Position Encodings · Softmax · Layer Normalization