UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Zihao Huang; Yu Bao; Qiyang Min; Siyan Chen; Ran Guo; Hongzhi Huang; Defa Zhu; Yutao Zeng; Banggu Wu; Xun Zhou; Siyuan Qiao

arXiv:2508.18756·cs.LG·August 27, 2025

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

PDF

3 Reviews

TL;DR

UltraMemV2 introduces a redesigned memory-layer architecture that achieves performance comparable to large MoE models with significantly lower memory access, especially excelling in long-context learning tasks.

Contribution

The paper presents UltraMemV2, a novel memory-layer design that closes the performance gap with MoE models while reducing memory access during inference.

Findings

01

Achieves parity with 8-expert MoE models in performance.

02

Improves long-context memorization by +1.6 points.

03

Enhances in-context learning by +7.9 points.

Abstract

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper clearly describes its position relative to MoE, PKM/UltraMem, and PEER. Experiments show that increasing the number of UltraMemV2 layers improves downstream accuracy even when validation loss plateaus. The proprietary long‑context suite shows non‑trivial gains of 6.2 on multi‑round memorizing and 7.9 on in‑context learning. The paper is explicit that UltraMemV2 underperforms early in training and benefits from continued training, and also notes dependence on per‑block placement.

Weaknesses

The paper asserts matching compute and parameters, but does not report KV‑cache costs, routing FLOPs, or memory traffic for both MoE and UltraMemV2. The claim that this work is the first memory layer to match 8‑expert MoE is not accurate in light of the Memory Layers at Scale paper [https://arxiv.org/abs/2412.09764].

Reviewer 02Rating 8Confidence 3

Strengths

Strong architectural contributions: the five design changes proposed by the authors are all justified through ablations and contribute to improved model performance Strong empirical evaluation: multiple model scales up to 120B and a diverse selection of benchmarks make the authors claims very convincing. Initialization analysis: the paper contributes a new initialization scheme to stabilize training of the memory layer, which addresses a common failure mode for large sparse modules. Practical

Weaknesses

The paper motivates UltraMemV2 with lower memory access and inference cost, but it would be more convincing to see latency and bandwidth comparisons vs. traditional MoE models Proprietary data: this might be unavoidable but the proprietary nature of the benchmarks and data limits the reproducibility of the methods in this paper The UltraMemV2 model has significantly worse benchmark performance on multi-hop reasoning. The paper would be improved if the authors investigated this further and demo

Reviewer 03Rating 6Confidence 3

Strengths

1. Good ablation for number of layers and overcomes the limitation of number of memory layers 2. Matched performence with MOEs with 8 experts 3. Uses strong benchmarks and evaluation 4. Simplifies the value expansion, making inference more efficient

Weaknesses

Overall the contribution is light, the paper aims to bridge the gap(in performance) between MOEs and Memory layer architectures. In terms of scientific novelty, some of the approaches seem incremental and this approach seems to combine multiple incremental tweaks to achieve performance improvements over baseline. For example, “Memory Layer at Scale”(Berges, 2024) paper demonstrated that was multiple memory layers increase performance significantly over having a single layer(In their case, perfo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.