AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
Ziming Wang, Xiang Wang, Kailong Peng, Lang Qin, Juan Gabriel Kostelec, Christos Sourmpis, Axel Laborieux, Qinghai Guo

TL;DR
AllMem introduces a hybrid architecture combining sliding window attention with memory networks, enabling efficient long-context modeling in LLMs with reduced computational costs and mitigated catastrophic forgetting.
Contribution
The paper presents a novel hybrid architecture and a memory-efficient fine-tuning method that allow pre-trained LLMs to effectively handle ultra-long contexts with minimal performance loss.
Findings
Achieves near-lossless performance on 37k LongBench with 4k window.
Outperforms full attention on 128k context in InfiniteBench.
Reduces computational and memory overhead during long-sequence inference.
Abstract
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
