Compressed Context Memory For Online Language Model Interaction
Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song

TL;DR
This paper introduces a compressed context memory system for Transformer language models that maintains high performance with significantly reduced memory usage, enabling efficient online interactions with unlimited context length.
Contribution
It proposes a novel recursive compression method integrated with conditional LoRA, allowing models to handle unlimited context efficiently without full fine-tuning.
Findings
Achieves full context model performance with 5x smaller memory.
Outperforms sliding window approach in streaming scenarios.
Supports unlimited context length in online language model interactions.
Abstract
This paper presents a context key/value compression method for Transformer language models in online scenarios, where the context continually expands. As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model. To address this challenge, we propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space, facilitating language model inference in a limited memory space of computing environments. Our compression process involves integrating a lightweight conditional LoRA into the language model's forward pass during inference, without the need for fine-tuning the model's entire set of weights. We achieve efficient training by modeling the recursive compression process as a single parallelized forward computation. Through…
Peer Reviews
Decision·ICLR 2024 poster
- The problem of efficiently handling expanding contexts is highly relevant given the online nature of systems like ChatGPT. The paper addresses an important open challenge. - The method is flexible and broadly applicable to diverse online inference scenarios like multi-task learning, personalization and conversation. - Empirical evaluations across three datasets substantiate the memory and computation advantages over baselines. The method achieves slightly lower performance than the full cont
- The main limitation of the proposed compression framework is that it is task-specific. The compression module must be trained for each task, which requires additional data, computation, and cannot generalize to new tasks. This is a significant drawback in the context of foundation models which are trained on large datasets for general-purpose use. - There is still a obvious gap in performance between the compressed and full context models. The paper does not provide a clear explanation for thi
1. The paper is overall sound. The method design is concise, effective, and efficient. Compared with retrieval-based method to re-compute the sentence embedding, the CCM can directly adopt the KV cache of introduced <COMP> token as the memory vector for one utterance and utilize them in further inference. To engage the LLM to utilize such CCM, the parallel training and LoRA adapter are designed well for efficient adaptation. 2. The CCM is efficient in both training and inference. Firstly, there
1. The CCM method is not that novel and has been explored well in some important early milestones before the creation of Transformer, i.e., Memory Networks, Fast Weights to Attend Recent Past. The author should mention and discuss the relation with these methods. Additionally, the Compress Transformer should be briefly introduced as it is not a universally known preliminary for readers. 2. In terms of the baselines, in the main tables, CCM is only compared with “no context" and "full context" b
- A interesting method to compress contexts in the few-shot learning setting. - The results evaluated in the few-shot learning tasks show the effectiveness and superiority over the conventional approaches like RMT and Gist.
While this paper presents a seemingly promising solution to long contexts, I have significant concerns about several limitations. Firstly, one of the main focuses of this paper is handling dynamic context for interaction. Judging from its experimental design, it mainly conducts experiments with a fine-tuned LLM for few-shot learning scenarios, which are generally simpler tasks, all being multi-choice, or classification tasks. The methods primarily compared in this paper are general context comp
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Context-Aware Activity Recognition Systems · Robotics and Automated Systems
MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Absolute Position Encodings · Softmax · Layer Normalization
