Learning and Transferring Sparse Contextual Bigrams with Linear   Transformers

Yunwei Ren; Zixuan Wang; Jason D. Lee

arXiv:2410.23438·cs.LG·November 1, 2024

Learning and Transferring Sparse Contextual Bigrams with Linear Transformers

Yunwei Ren, Zixuan Wang, Jason D. Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces the Sparse Contextual Bigram (SCB) model, analyzes its training dynamics with linear transformers, and demonstrates how pretraining and finetuning can improve learning efficiency in language models.

Contribution

The paper proposes the SCB model as a natural extension of bigram models, analyzes its training process with linear transformers, and shows how pretraining enhances sample efficiency.

Findings

01

Training has an initial sample-intensive phase boosting correlation.

02

Finetuning from pretrained models bypasses the initial phase.

03

The proposed algorithm can outperform SGD in learning SCB.

Abstract

Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning and Transferring Sparse Contextual Bigrams with Linear Transformers· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Indoor and Outdoor Localization Technologies

MethodsSparse Evolutionary Training · Stochastic Gradient Descent