Enhancing elusive clues in knowledge learning by contrasting attention of language models
Jian Gao, Xiao Zhang, Ji Wu, Miao Li

TL;DR
This paper introduces a method to improve knowledge learning in language models by contrasting attention patterns of different-sized models to identify and emphasize elusive clues, leading to better memorization and learning efficiency.
Contribution
The paper proposes a novel approach that leverages attention contrast between large and small models to enhance learning from subtle clues in training data.
Findings
Larger models focus more on non-obvious clues.
Contrasting attention helps identify important but overlooked clues.
Token-dropout guided by clues improves memorization performance.
Abstract
Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Innovative Teaching and Learning Methods
MethodsSoftmax · Attention Is All You Need
