Characterizing Verbatim Short-Term Memory in Neural Language Models
Kristijan Armeni, Christopher Honey, Tal Linzen

TL;DR
This paper investigates how transformer and LSTM language models retrieve prior context, revealing transformers' ability to precisely recall exact words and order, while LSTMs show more limited, coarse retrieval.
Contribution
The study demonstrates that transformers function as a flexible working memory system capable of precise token retrieval, unlike LSTMs which maintain a less detailed semantic gist.
Findings
Transformers retrieve exact words and order from prior context.
LSTMs show limited, less precise retrieval focused on early tokens.
Retrieval ability in transformers improves with larger training data and depth.
Abstract
When a language model is trained to predict natural language sequences, its prediction at each moment depends on a representation of prior context. What kind of information about the prior context can language models retrieve? We tested whether language models could retrieve the exact words that occurred previously in a text. In our paradigm, language models (transformers and an LSTM) processed English text in which a list of nouns occurred twice. We operationalized retrieval as the reduction in surprisal from the first to the second list. We found that the transformers retrieved both the identity and ordering of nouns from the first list. Further, the transformers' retrieval was markedly enhanced when they were trained on a larger corpus and with greater model depth. Lastly, their ability to index prior tokens was dependent on learned attention patterns. In contrast, the LSTM exhibited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
