Language Modeling With Factorization Memory
Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai

TL;DR
This paper introduces Factorization Memory, an efficient RNN architecture that matches Transformer performance on short texts, excels in long-context tasks, and innovatively combines sparse memory updates with high accuracy.
Contribution
It presents the first RNN with sparse memory activation that maintains high performance across short and long contexts, improving efficiency and generalization.
Findings
Achieves comparable performance to Transformers on short-context tasks.
Demonstrates superior generalization in long-context scenarios.
Introduces a novel sparse formulation of RNN memory.
Abstract
We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Factorization Memory offers a genuinely novel approach to address the long-standing challenge of long-context understanding in RNNs. By introducing a 2D memory state with sparse updates, it attempts to reconcile the trade-off between memory capacity and computational efficiency. 2. The concept of sparse memory activation, where only a subset of recurrent states are updated at each step, is a significant strength. This directly tackles the computational and memory bottlenecks of dense recurren
1. "MoM: Linear Sequence Modeling with Mixture-of-Memories" is a good work also focus on sparsely expanding RNN memory states, but this paper did not compare with it. I recommend the authors to compare their algorithm difference and experimental performance. 2. The primary limitation stated by the authors themselves is the constraint of computational resources, leading to investigations on "relatively small-scale models with a low FLOPS budget" (60-70 million parameters for memory scaling, 1B pa
The motivation and design for extending model sparsity to the memory mechanism of linear model architecture is reasonable.
1. The authors claim that "To our knowledge, this represents the **first** RNN architecture that successfully combines sparse memory activation." However, similar sparse memory activation ideas have been explored in other works, such as MoM[1], SEE[2], etc. Further comparison and discussion with these works are needed. 2. The method builds on Mamba2, but lacks other more advanced RNNs as baselines, such as GLA[3], DeltaNet[4], and Gated DeltaNet[5]. Can proposed method be used in other RNN model
1. This paper aims to solve the problem that linear models often struggle to compress long sequences into a fixed-size memory. Their performance on long-context tasks, such as NIAH and recall-intensive tasks, typically falls behind Transformers. This is a well-known and valuable problem. 2. FM shows better performance on long-context tasks, verifying the benefits of sparse memory activation and increased memory capacity. 3. FM can be more efficient than dense activation of memories.
1. The key point of this paper highly overlaps with MoM [1]. Both methods adopt sparse activation of memories with a router to direct different input tokens to the top-k memories. The theoretical framework was introduced in MoM, and FM seems like a special case that adopts Mamba-2 as the memory update rule. 2. The experiments are conducted using Mamba-2, but while FM could be applied to other linear models to verify its effect, this paper lacks such validation on other architectures (e.g., GLA,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
