EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Arth Singh

arXiv:2604.08556·cs.CL·April 13, 2026

EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Arth Singh

PDF

TL;DR

This paper investigates the capabilities and limitations of exponential moving average (EMA) traces as a simple recurrent context, revealing their strengths in encoding structure but weaknesses in preserving token identity, and highlighting the necessity of learned, input-dependent mechanisms.

Contribution

The study demonstrates that fixed-coefficient accumulation like EMA captures structural information but cannot retain token identity, emphasizing the need for learned, input-dependent selection in sequence models.

Findings

01

EMA traces encode temporal structure effectively.

02

EMA-based language model achieves high perplexity, close to GPT-2.

03

Lossless information recovery requires learned, input-dependent mechanisms.

Abstract

What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.