MIDUS: Memory-Infused Depth Up-Scaling

Taero Kim; Hoyoon Byun; Youngjun Choi; Sungrae Park; Kyungwoo Song

arXiv:2512.13751·cs.LG·May 12, 2026

MIDUS: Memory-Infused Depth Up-Scaling

Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song

PDF

TL;DR

MIDUS introduces a novel method for expanding language models by replacing dense feedforward network branches with memory layers, enabling efficient capacity increase through lightweight retrieval-based residuals.

Contribution

It proposes Memory-Infused Depth Up-Scaling (MIDUS), a new approach that replaces FFN branches with memory layers and employs head-wise memory and retrieval mechanisms.

Findings

01

Empirical improvements in model performance and efficiency.

02

Structural analysis shows HML with HIVE as a head-conditioned residual alternative.

03

Demonstrates effective capacity expansion without increasing dense FFN parameters.

Abstract

Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.