Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Yifan Hao; Chenlu Ye; Chi Han; Tong Zhang

arXiv:2506.01919·cs.LG·June 3, 2025

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Yifan Hao, Chenlu Ye, Chi Han, Tong Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper explores how Transformers process sequences by analyzing their layerwise behavior, revealing that lower layers extract features while upper layers achieve feature decoupling, which explains their multi-task learning capabilities.

Contribution

It provides the first detailed layerwise analysis of Transformers on Hidden Markov Models, combining empirical observations with theoretical insights into their expressiveness.

Findings

01

Lower layers focus on neighboring token features

02

Upper layers exhibit high time disentanglement

03

Theoretical analysis supports empirical observations

Abstract

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive theoretical understanding of this phenomenon remains limited. In this work, we investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability. Taking explorations on a typical sequence model, i.e, Hidden Markov Models, which are fundamental to many language tasks, we observe that: first, lower layers of Transformers focus on extracting feature representations, primarily influenced by neighboring tokens; second, on the upper layers, features become decoupled, exhibiting a high degree of time disentanglement. Building on these empirical insights, we provide theoretical analysis for the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper is an interesting extension of previous works on the ability of transformers to learn Markov data to the more general HMMs. Filling this gap is of theoretical interest, especially due to the proof tools employed, leveraging previous literature on HMMs and ML architectures, and some of the theoretical assumptions used to simplify the theoretical analysis, such as the low-rank structure of the Markov transition kernel.

Weaknesses

My main concerns about the paper is the lack of clarity and details of the experiments in Section 2. I think this Section need extensive rewriting. The dataset construction described in Section 2.1 is very badly explained and not clear at all. A bit better is the description provided in Appendix B.1. Figures 1-2-3 are not described in sufficient detail and are very confusing and hard to understand. In particular, it is not clear how Figure 3 is generated. The Related Work section is also way to

Reviewer 02Rating 2Confidence 2

Strengths

1. The paper attempts to address an extremely interesting and important problem. 2. The authors aim to understand this phenomenon from both theoretical and practical perspectives.

Weaknesses

1. I spent a considerable amount of time on this paper, but still found it difficult to parse and follow. A thorough rewrite could significantly improve its clarity and readability. 2. Several figures are also difficult to interpret and would benefit from clearer labeling or improved presentation. 3. Additionally, some relevant references appear to be missing and should be included to provide appropriate context and attribution.

Reviewer 03Rating 2Confidence 4

Strengths

Using a structured synthetic setup like Hidden Markov Models to study how transformers learn the task structure is a reasonable and well-motivated approach. The approach of pairing empirical observations about layer-wise behavior with a theoretical characterization of how transformers can approximate HMM distributions has the potential to provide complementary insights.

Weaknesses

The overall presentation lacks clarity and organization, which makes it difficult to follow the results, and the paper would benefit from a thorough revision. The organization is confusing. For instance, many details about the experiments in Section 2 are either missing or only defined later (such as the data format introduced in Section 3), and even there, not clearly. The probing experiment starting around line 126 is not clearly described and the discussions are not concrete enough. The metri

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis