Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
Pranava Madhyastha, Dagmar Adamcova

TL;DR
This paper explores integrating human-like working memory constraints into Transformer models, demonstrating improved grammatical accuracy and alignment with human data in low-data scenarios.
Contribution
It introduces cognitively inspired attention mechanisms into GPT-2, showing their effectiveness in data-scarce linguistic tasks.
Findings
Fixed-width attention improves grammatical accuracy in limited data.
Constrained models align better with human reading time data.
Memory constraints serve as beneficial inductive biases.
Abstract
We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
