Gradient Localization Improves Lifelong Pretraining of Language Models
Jared Fernandez, Yonatan Bisk, Emma Strubell

TL;DR
This paper investigates how different types of knowledge are stored in language models and shows that focusing on specific parameter layers improves continual pretraining, especially for temporally sensitive information.
Contribution
It reveals the localization of knowledge in language models and proposes targeted parameter updates to enhance lifelong learning of temporal information.
Findings
Knowledge about entities is localized to specific model parameters.
Targeted updates to relevant layers improve continual pretraining performance.
Focusing on layers with larger gradient norms aids in learning temporal drift.
Abstract
Large Language Models (LLMs) trained on web-scale text corpora have been shown to capture world knowledge in their parameters. However, the mechanism by which language models store different types of knowledge is poorly understood. In this work, we examine two types of knowledge relating to temporally sensitive entities and demonstrate that each type is localized to different sets of parameters within the LLMs. We hypothesize that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information. We observe that sequences containing references to updated and newly mentioned entities exhibit larger gradient norms in a subset of layers. We demonstrate that targeting parameter updates to these relevant layers can improve the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
