TL;DR
CoMeT introduces a memory-efficient Transformer architecture that handles arbitrarily long sequences with constant memory and linear time, enabling effective long-context modeling in large language models.
Contribution
It proposes a novel dual-memory system and layer-level parallelism strategy, allowing minimal fine-tuning of pre-trained models for long sequences.
Findings
Models with CoMeT can retrieve information from 1M tokens.
Outperforms other efficient methods on the SCROLLS benchmark.
Achieves comparable performance to full-attention models on summarization.
Abstract
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
