Gated Differentiable Working Memory for Long-Context Language Modeling

Lingrui Mei; Shenghua Liu; Yiwei Wang; Yuyao Ge; Baolong Bi; Jiayu Yao; Jun Wan; Ziling Yin; Jiafeng Guo; Xueqi Cheng

arXiv:2601.12906·cs.CL·January 21, 2026

Gated Differentiable Working Memory for Long-Context Language Modeling

Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, Xueqi Cheng

PDF

Open Access

TL;DR

This paper introduces Gdwm, a memory consolidation framework for long-context language modeling that improves efficiency and performance by selectively updating memory based on contextual utility, reducing computation while maintaining accuracy.

Contribution

Gdwm is the first to incorporate a gated, utility-based memory consolidation mechanism for test-time adaptation in long-context language models.

Findings

01

Gdwm achieves comparable or better performance with 4x fewer gradient steps.

02

It establishes a new efficiency-performance Pareto frontier.

03

Experiments on ZeroSCROLLS and LongBench v2 validate its effectiveness.

Abstract

Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning