MIDUS: Memory-Infused Depth Up-Scaling
Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song

TL;DR
MIDUS introduces a novel method for expanding language models by replacing dense feedforward network branches with memory layers, enabling efficient capacity increase through lightweight retrieval-based residuals.
Contribution
It proposes Memory-Infused Depth Up-Scaling (MIDUS), a new approach that replaces FFN branches with memory layers and employs head-wise memory and retrieval mechanisms.
Findings
Empirical improvements in model performance and efficiency.
Structural analysis shows HML with HIVE as a head-conditioned residual alternative.
Demonstrates effective capacity expansion without increasing dense FFN parameters.
Abstract
Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
