EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
Arth Singh

TL;DR
This paper investigates the capabilities and limitations of exponential moving average (EMA) traces as a simple recurrent context, revealing their strengths in encoding structure but weaknesses in preserving token identity, and highlighting the necessity of learned, input-dependent mechanisms.
Contribution
The study demonstrates that fixed-coefficient accumulation like EMA captures structural information but cannot retain token identity, emphasizing the need for learned, input-dependent selection in sequence models.
Findings
EMA traces encode temporal structure effectively.
EMA-based language model achieves high perplexity, close to GPT-2.
Lossless information recovery requires learned, input-dependent mechanisms.
Abstract
What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
