Language Modeling With Factorization Memory

Lee Xiong; Maksim Tkachenko; Johanes Effendi; Ting Cai

arXiv:2511.00315·cs.CL·November 4, 2025

Language Modeling With Factorization Memory

Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Factorization Memory, an efficient RNN architecture that matches Transformer performance on short texts, excels in long-context tasks, and innovatively combines sparse memory updates with high accuracy.

Contribution

It presents the first RNN with sparse memory activation that maintains high performance across short and long contexts, improving efficiency and generalization.

Findings

01

Achieves comparable performance to Transformers on short-context tasks.

02

Demonstrates superior generalization in long-context scenarios.

03

Introduces a novel sparse formulation of RNN memory.

Abstract

We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

1. Factorization Memory offers a genuinely novel approach to address the long-standing challenge of long-context understanding in RNNs. By introducing a 2D memory state with sparse updates, it attempts to reconcile the trade-off between memory capacity and computational efficiency. 2. The concept of sparse memory activation, where only a subset of recurrent states are updated at each step, is a significant strength. This directly tackles the computational and memory bottlenecks of dense recurren

Weaknesses

1. "MoM: Linear Sequence Modeling with Mixture-of-Memories" is a good work also focus on sparsely expanding RNN memory states, but this paper did not compare with it. I recommend the authors to compare their algorithm difference and experimental performance. 2. The primary limitation stated by the authors themselves is the constraint of computational resources, leading to investigations on "relatively small-scale models with a low FLOPS budget" (60-70 million parameters for memory scaling, 1B pa

Reviewer 02Rating 2Confidence 4

Strengths

The motivation and design for extending model sparsity to the memory mechanism of linear model architecture is reasonable.

Weaknesses

1. The authors claim that "To our knowledge, this represents the **first** RNN architecture that successfully combines sparse memory activation." However, similar sparse memory activation ideas have been explored in other works, such as MoM[1], SEE[2], etc. Further comparison and discussion with these works are needed. 2. The method builds on Mamba2, but lacks other more advanced RNNs as baselines, such as GLA[3], DeltaNet[4], and Gated DeltaNet[5]. Can proposed method be used in other RNN model

Reviewer 03Rating 2Confidence 4

Strengths

1. This paper aims to solve the problem that linear models often struggle to compress long sequences into a fixed-size memory. Their performance on long-context tasks, such as NIAH and recall-intensive tasks, typically falls behind Transformers. This is a well-known and valuable problem. 2. FM shows better performance on long-context tasks, verifying the benefits of sparse memory activation and increased memory capacity. 3. FM can be more efficient than dense activation of memories.

Weaknesses

1. The key point of this paper highly overlaps with MoM [1]. Both methods adopt sparse activation of memories with a router to direct different input tokens to the top-k memories. The theoretical framework was introduced in MoM, and FM seems like a special case that adopts Mamba-2 as the memory update rule. 2. The experiments are conducted using Mamba-2, but while FM could be applied to other linear models to verify its effect, this paper lacks such validation on other architectures (e.g., GLA,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications