Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures
Zhiheng Chen, Ruofan Wu, Guanhua Fang

TL;DR
This paper explores the use of transformer architectures for solving Gaussian Mixture Models in unsupervised learning, demonstrating both empirical effectiveness and theoretical approximation capabilities.
Contribution
It introduces TGMM, a transformer-based framework for GMMs, and proves transformers can approximate classical unsupervised algorithms like EM and spectral methods.
Findings
Transformers effectively solve GMM tasks and outperform classical methods.
TGMM demonstrates robustness to distribution shifts.
Transformers can approximate EM and spectral algorithms theoretically.
Abstract
The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper proposes a new algorithm that uses transformer architectures for GMM learning. 2. The authors empirically validate the effectiveness, robustness and flexibility of TGMM, and provide theoretical justifications by proving approximability results of their method.
1. While the authors try to demonstrate the effectiveness of TGMM via experiments, it actually performs worse than spectral methods in Figure 2 and Figure 4\. These figures also fail to decouple statistical errors and algorithmic errors. Spectral methods, as theoretically guaranteed, will have errors converging towards 0 with enough samples. The authors should at least show the same (empirically or theoretically) for TGMM. To demonstrate effectiveness, in my opinion, they should also compare sam
• The approach is sound and improves upon the related work especially of He et al 2025b. • The results are compelling and the procedure appears promising when compared to EM and spectral estimation of GMMs. • The design to accommodate different number of components seems logical and useful. Originality: The approach expands upon current efforts utilizing Transformers in the context of unsupervised learning in GMM especially leveraging the recent works of He et al 2025a,b with substantial impr
The considered GMM formalism is quite limited and it would strengthen the contribution to consider more realistic settings such as GMMs with diagonal covariance clusters. It is unclear why this would form a major challenge in the presented framework as it technically just requires expanding the readout function to have parameters for the variances of similar dimensionality as the produced means in the readout function and loss functions based on an additional squared error loss term for the diag
- The paper provides an interesting connection between the modern approach and the classical approach in machine learning. I believe this connection can open up a new pathway to explaining the fundamental structures of different models. - The presentation is clear. Readers of different levels of expertise should be able to grasp the central idea of this paper.
- The paper is generally good.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Bayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference
