Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Pranava Madhyastha; Dagmar Adamcova

arXiv:2604.20789·cs.CL·April 24, 2026

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Pranava Madhyastha, Dagmar Adamcova

PDF

TL;DR

This paper explores integrating human-like working memory constraints into Transformer models, demonstrating improved grammatical accuracy and alignment with human data in low-data scenarios.

Contribution

It introduces cognitively inspired attention mechanisms into GPT-2, showing their effectiveness in data-scarce linguistic tasks.

Findings

01

Fixed-width attention improves grammatical accuracy in limited data.

02

Constrained models align better with human reading time data.

03

Memory constraints serve as beneficial inductive biases.

Abstract

We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.