A Meta-Learning Perspective on Transformers for Causal Language Modeling
Xinbo Wu, Lav R. Varshney

TL;DR
This paper presents a meta-learning perspective on Transformer models used for causal language modeling, revealing inner optimization dynamics and analyzing token representation norms to better understand their capabilities.
Contribution
It introduces a novel meta-learning framework for Transformers in causal language modeling and provides theoretical analysis of token representation norms.
Findings
Inner optimization process in Transformers explained
Theoretical analysis of token representation norms
Experimental validation across various settings
Abstract
The Transformer architecture has become prominent in developing large causal language models. However, mechanisms to explain its capabilities are not well understood. Focused on the training process, here we establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task, by explicating an inner optimization process within the Transformer. Further, within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models. Our analysis is supported by experiments in various settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection
