MoM: Linear Sequence Modeling with Mixture-of-Memories
Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng

TL;DR
MoM introduces a multi-memory architecture for linear sequence models, significantly improving recall performance while maintaining linear training complexity and constant inference complexity, bridging the gap with Transformer models.
Contribution
MoM proposes a novel multi-memory framework with a routing mechanism, enhancing memory capacity and recall ability in linear sequence models, a significant advancement over existing single-memory approaches.
Findings
MoM outperforms existing linear models on recall-intensive language tasks.
MoM achieves performance comparable to Transformer models.
MoM maintains linear training and constant inference complexity.
Abstract
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper pinpoints the weakness of compressing an entire sequence into a single memory state in linear models and connects this limitation to memory interference. 2. The empirical study is extensive, covering recall-intensive benchmarks, long-context tasks, memory behavior analysis, scaling and ablation studies. 3. The authors provide qualitative insights into memory specialization and routing distributions, enhancing the interpretability of MoM. 4. Implementation details show careful con
1. The router formulation specifies a scheme combining softmax and Top-K but lacks crucial implementation details. It does not clarify how ties in Top-K selection are resolved, whether gradients are propagated through the Top-K operation or treated as non-differentiable, or how the learned matrix affects the sparsity of routing and the memory load balance. It also leaves open whether the router is robust to input distribution shifts or token imbalance. 2. Table 5 provides qualitative insights i
- The paper targets a critical and widely recognized weakness of linear-time sequence models: their poor performance on recall-intensive tasks due to the bottleneck of a single fixed-size memory state. - The experimental results are comprehensive and robust. - The core idea of using a top-k router to manage multiple, independent RNN memory states is a novel and clever application of sparse activation principles.
- The ablation study in Table 6 indicates that the "Shared Memory" component is critical. Removing it causes a large performance drop (e.g., 2.1 points on Recall tasks). This makes it difficult to disentangle the gains from the "Mixture" mechanism (routing to $k$ of $N$ memories) from the gains of simply having a parallel, always-on "Shared Memory" state. The paper's narrative focuses heavily on the top-k mixture, but a large portion of the gains might be coming from this simpler shared componen
1. Well-motivated problem formulation: The paper clearly identifies memory interference and limited capacity as core issues in linear sequence models, providing both intuitive explanations and neuroscience-inspired motivation from hippocampal theta-gamma oscillations. 2. General and flexible framework: MoM's compatibility with various memory update mechanisms (Table 1 lists 11 different methods) demonstrates its generality. This is a significant practical advantage, allowing easy integration wit
1. While related work in linear sequence modeling and MoE has been mentioned, it lacks in-depth comparisons and differential analyses with recent methods (such as RWKV-7 and Titans). In particular, Table 1 lists these methods but does not systematically experimentally compare the performance differences of MoM under different memory update mechanisms. 2. Table 5 claims to have discovered "specialization" in different memories, but it is based solely on qualitative observations of an intermediate
Code & Models
- 🤗linear-moe-hub/MoM-Gated-Deltanet-340Mmodel· 9 dl· ♡ 29 dl♡ 2
- 🤗linear-moe-hub/MoM-Gated-Deltanet-1.3Bmodel· 5 dl· ♡ 35 dl♡ 3
- 🤗linear-moe-hub/Gated-Deltanet-1.3Bmodel· 65 dl· ♡ 565 dl♡ 5
- 🤗linear-moe-hub/Gated-Deltanet-340Mmodel· 323 dl· ♡ 1323 dl♡ 1
- 🤗linear-moe-hub/GLA-340Mmodel· 7 dl7 dl
- 🤗linear-moe-hub/GSA-340Mmodel· 2 dl2 dl
- 🤗linear-moe-hub/HGRN2-340Mmodel
- 🤗linear-moe-hub/RetNet-340Mmodel· 3 dl3 dl
- 🤗linear-moe-hub/Transformer-340Mmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Neural Networks and Applications · Bayesian Methods and Mixture Models
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
