Learning and Transferring Sparse Contextual Bigrams with Linear Transformers
Yunwei Ren, Zixuan Wang, Jason D. Lee

TL;DR
This paper introduces the Sparse Contextual Bigram (SCB) model, analyzes its training dynamics with linear transformers, and demonstrates how pretraining and finetuning can improve learning efficiency in language models.
Contribution
The paper proposes the SCB model as a natural extension of bigram models, analyzes its training process with linear transformers, and shows how pretraining enhances sample efficiency.
Findings
Training has an initial sample-intensive phase boosting correlation.
Finetuning from pretrained models bypasses the initial phase.
The proposed algorithm can outperform SGD in learning SCB.
Abstract
Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Indoor and Outdoor Localization Technologies
MethodsSparse Evolutionary Training · Stochastic Gradient Descent
