Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems
Hongbo Li, Qinhang Wu, Sen Lin, Yingbin Liang, and Ness B. Shroff

TL;DR
This paper introduces the Mixture-of-Transformers (MoT), a theoretical framework that explains how expert specialization and attention alignment improve training efficiency and learning speed in transformer models, with proven convergence guarantees.
Contribution
It provides the first unified theoretical analysis of transformer-level specialization, learning dynamics, and convergence rates, supported by extensive experiments.
Findings
Expert specialization reduces gradient conflicts.
Training drives prediction loss to near zero in logarithmic steps.
MoT significantly outperforms single transformers in convergence speed.
Abstract
Mixture-of-Experts (MoE) models improve transformer efficiency but lack a unified theoretical explanation, especially when both feed-forward and attention layers are allowed to specialize. To this end, we study the Mixture-of-Transformers (MoT), a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network. This design allows us to isolate and study the core learning dynamics of expert specialization and attention alignment. In particular, we develop a three-stage training algorithm with continuous training of the gating network, and show that each transformer expert specializes in a distinct class of tasks and that the gating network accurately routes data samples to the correct expert. Our analysis shows how expert specialization reduces gradient conflicts and makes each subtask strongly convex. We prove that the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. To the best of my knowledge, the theoretical foundation of MoT is new. 2. The paper is clear and easy to follow.
1. While I appreciate the theoretical challenges involved, the proposed model appears somewhat simplified. More importantly, the authors do not sufficiently address the gap between theory and practice. For instance, do the optimal choices of \( T_1 \) and \( T_2 \) suggested by Propositions 1 and 2 actually translate into practical improvements? Furthermore, Figure 2 does not demonstrate the expected linear convergence rate—performance remains comparable to baselines that only enjoy sublinear th
(1) Provides the first rigorous analysis of full-transformer specialization, bridging a major gap in the MoE literature. The three-stage training procedure is interesting and isolates the roles of FFN and attention specialization. (2) Proves faster convergence rates for MoT compared to standard baselines, supported by detailed proofs. (3) Experiments on CIFAR-10/100 datasets robustly support the theoretical claims, demonstrating practical impact.
(1) For MoE, we typically have a one-stage training that learn all the model parameters (Attn, MLP, Gating) altogether. This paper proposed 3-stage training, which will complicate the training procedure. It might be better to discuss some ablation analysis, such as how the analysis will change for different stages of training, e.g., 2-stage or one single stage. (2) While experiments look solid, they focus on relatively small-scale model (e.g., lightweight Vision Transformer) and benchmarks.
- **Clean theoretical result (in isolation):** The $\mathcal{O}(\log(\epsilon^{-1}))$ vs. $\mathcal{O}(\epsilon^{-1})$ contrast is stark and theoretically elegant. Proving that the attention-absent MoE has a fundamental error floor (Lemma 1) is a good theoretical contribution, highlighting why attention specialization is necessary in this problem setup. - **clarity:** The paper is well-organized and clearly written. It systematically presents the model, the training algorithm, and the theoretic
- **(Maybe) overly simplified theoretical model:** As detailed under "Soundness," the theoretical analysis rests on a foundation of strong assumptions (orthogonal data, single-layer model, merged matrices) that may not hold in practice. This severely limits the direct applicability of the findings to the deep, complex transformers used in the real world. The theory does not account for the interactions between multiple layers of experts. - **lack of realistic empirical validation:** The experim
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
