How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang, Yingbin Liang, Jing Yang

TL;DR
This paper provides a theoretical analysis of how one-layer transformers learn regular language recognition tasks, revealing training phases, implicit biases, and the role of Chain-of-Thought in solving parity checks.
Contribution
It offers a novel theoretical framework analyzing training dynamics and implicit bias of transformers on regular language tasks, including the integration of Chain-of-Thought.
Findings
Attention layer rapidly separates data in early training
Linear layer approaches max-margin hyperplane logarithmically
Training loss decreases at a rate of O(1/t)
Abstract
Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as `even pairs' and `parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · Linear Layer · Focus
