On Limitation of Transformer for Learning HMMs
Jiachen Hu, Qinghua Liu, Chi Jin

TL;DR
This paper evaluates the limitations of Transformer architectures in learning Hidden Markov Models, revealing their underperformance compared to RNNs and proposing a Chain-of-Thought variant to improve their capabilities.
Contribution
It provides empirical evidence of Transformers' struggles with HMMs and introduces a novel block CoT method to enhance their sequence learning, supported by theoretical expressiveness results.
Findings
Transformers underperform RNNs in learning HMMs.
A block CoT variant improves Transformers' learning of longer sequences.
Theoretical proof of Transformers' ability to approximate HMMs with logarithmic depth.
Abstract
Despite the remarkable success of Transformer-based architectures in various sequential modeling tasks, such as natural language processing, computer vision, and robotics, their ability to learn basic sequential models, like Hidden Markov Models (HMMs), is still unclear. This paper investigates the performance of Transformers in learning HMMs and their variants through extensive experimentation and compares them to Recurrent Neural Networks (RNNs). We show that Transformers consistently underperform RNNs in both training speed and testing accuracy across all tested HMM models. There are even challenging HMM instances where Transformers struggle to learn, while RNNs can successfully do so. Our experiments further reveal the relation between the depth of Transformers and the longest sequence length it can effectively learn, based on the types and the complexity of HMMs. To address the…
Peer Reviews
Decision·Submitted to ICLR 2025
I thought the idea of using sequences from HMMs to study the learning abilities of transformers quite a good one -- it is possible to control the complexity of the sequences by the Markov-order of the HMMs; it is possible to control the number of states, and output distribution etc. The authors created Cyclic-{DET, RND,HARD} models which have different properties in terms of complexity, mixing length etc. The block COT method seems to be good way to propagate more information through the model
I don't want to be too certain about what the weakness of the paper are, until I read the authors' rebuttals and clarifications, since its possible I may have missed some details and motivations. My first reaction, is, sadly an instinctive one. In as much as it makes sense to me that deeper models should perform better, and that the more complex the sequence type, the harder it would be for an algorithm to model it. However, I am quite surprised, to see that recurrent models learn so much faste
The work is timely given the widespread adaptation of transformer-based models. HMMs are important area to study. The findings are backed by theoretical analyses. The paper is well written and easy to follow.
The experiments were mostly conducted with model HMMs. If experiments with real-world datasets are included, this can show this can translate to practical benefits and implications
* Understanding our tools is important in ML; different methods have different trade-offs and this is another demonstration of the types of problems that transformers struggle with and other models (or hybrid models) should be applied. * Their proof in 5.2 of the number of layers L needed in the transformer to model an HMM of length 2^L is a useful bound that’ll find applications outside of this domain. Many problems being solved in LLMs are highly sequential, and there can be trade-off betwe
* HMMs are intrinsically a sequential problem - in a way, it’s unsurprising that the Transformer model would perform worse at them. E.g., expressing the “copy the symbols from N characters ago) task is pretty hard in a HMM or an RNN, but is very natural in a Transformer. Limiting the scope of problems you look at is fine, but it would be useful to have at least a short discussion of the types of problems that would be well modeled by a HMM or not. * The paper would have been stronger if a par
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
