Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models
Yifan Hao, Chenlu Ye, Chi Han, Tong Zhang

TL;DR
This paper explores how Transformers process sequences by analyzing their layerwise behavior, revealing that lower layers extract features while upper layers achieve feature decoupling, which explains their multi-task learning capabilities.
Contribution
It provides the first detailed layerwise analysis of Transformers on Hidden Markov Models, combining empirical observations with theoretical insights into their expressiveness.
Findings
Lower layers focus on neighboring token features
Upper layers exhibit high time disentanglement
Theoretical analysis supports empirical observations
Abstract
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive theoretical understanding of this phenomenon remains limited. In this work, we investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability. Taking explorations on a typical sequence model, i.e, Hidden Markov Models, which are fundamental to many language tasks, we observe that: first, lower layers of Transformers focus on extracting feature representations, primarily influenced by neighboring tokens; second, on the upper layers, features become decoupled, exhibiting a high degree of time disentanglement. Building on these empirical insights, we provide theoretical analysis for the…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is an interesting extension of previous works on the ability of transformers to learn Markov data to the more general HMMs. Filling this gap is of theoretical interest, especially due to the proof tools employed, leveraging previous literature on HMMs and ML architectures, and some of the theoretical assumptions used to simplify the theoretical analysis, such as the low-rank structure of the Markov transition kernel.
My main concerns about the paper is the lack of clarity and details of the experiments in Section 2. I think this Section need extensive rewriting. The dataset construction described in Section 2.1 is very badly explained and not clear at all. A bit better is the description provided in Appendix B.1. Figures 1-2-3 are not described in sufficient detail and are very confusing and hard to understand. In particular, it is not clear how Figure 3 is generated. The Related Work section is also way to
1. The paper attempts to address an extremely interesting and important problem. 2. The authors aim to understand this phenomenon from both theoretical and practical perspectives.
1. I spent a considerable amount of time on this paper, but still found it difficult to parse and follow. A thorough rewrite could significantly improve its clarity and readability. 2. Several figures are also difficult to interpret and would benefit from clearer labeling or improved presentation. 3. Additionally, some relevant references appear to be missing and should be included to provide appropriate context and attribution.
Using a structured synthetic setup like Hidden Markov Models to study how transformers learn the task structure is a reasonable and well-motivated approach. The approach of pairing empirical observations about layer-wise behavior with a theoretical characterization of how transformers can approximate HMM distributions has the potential to provide complementary insights.
The overall presentation lacks clarity and organization, which makes it difficult to follow the results, and the paper would benefit from a thorough revision. The organization is confusing. For instance, many details about the experiments in Section 2 are either missing or only defined later (such as the data format introduced in Section 3), and even there, not clearly. The probing experiment starting around line 126 is not clearly described and the discussions are not concrete enough. The metri
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
