How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Ruiquan Huang; Yingbin Liang; Jing Yang

arXiv:2505.00926·cs.LG·May 30, 2025

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Ruiquan Huang, Yingbin Liang, Jing Yang

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of how one-layer transformers learn regular language recognition tasks, revealing training phases, implicit biases, and the role of Chain-of-Thought in solving parity checks.

Contribution

It offers a novel theoretical framework analyzing training dynamics and implicit bias of transformers on regular language tasks, including the integration of Chain-of-Thought.

Findings

01

Attention layer rapidly separates data in early training

02

Linear layer approaches max-margin hyperplane logarithmically

03

Training loss decreases at a rate of O(1/t)

Abstract

Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as `even pairs' and `parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need · Linear Layer · Focus