Transformers versus the EM Algorithm in Multi-class Clustering
Yihan He, Hong-Yu Chen, Yuan Cao, Jianqing Fan, and Han Liu

TL;DR
This paper investigates how Transformer models can be theoretically and empirically effective in unsupervised multi-class clustering of Gaussian Mixture Models, revealing their strong learning capabilities.
Contribution
It establishes a theoretical connection between Transformers and the EM algorithm, providing approximation bounds and demonstrating minimax optimal rates for clustering tasks.
Findings
Transformers can approximate EM steps with universal bounds.
With enough data and proper initialization, Transformers achieve optimal clustering rates.
Empirical results confirm Transformers' strong inference abilities beyond theoretical assumptions.
Abstract
LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between the Softmax Attention layers and the workflow of the EM algorithm on clustering the mixture of Gaussians. Our theory provides approximation bounds for the Expectation and Maximization steps by proving the universal approximation abilities of multivariate mappings by Softmax functions. In addition to the approximation guarantees, we also show that with a sufficient number of pre-training samples and an initialization, Transformers can achieve the minimax optimal rate for the problem considered.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Face and Expression Recognition
