Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

Anrui Chen; Ruijun Huang; Xin Zhang; Fang Dong; Hengjie Cao; Zhendong Huang; Yifeng Yang; Mengyi Chen; Jixian Zhou; Mingzhi Dong; Yujiang Wang; Jinlong Hou; Qin Lv; Robert P. Dick; Yuan Cheng; Tun Lu; Fan Yang; Li Shang

arXiv:2602.12587·cs.LG·February 16, 2026

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

PDF

Open Access

TL;DR

This paper identifies a pre-routing bottleneck in MoE Transformers caused by multi-head attention, which leads to catastrophic forgetting, and proposes MH-MoE with head-wise routing to mitigate this issue.

Contribution

The paper reveals how multi-head attention causes routing collisions in MoE Transformers and introduces MH-MoE, a head-wise routing method that reduces forgetting in continual learning.

Findings

01

MH-MoE reduces BWT from 11.2% to 4.5% on Qwen3-0.6B.

02

Higher effective composition number $N_{eff}$ correlates with increased forgetting.

03

Routing collisions are a key factor in catastrophic forgetting in MoE Transformers.

Abstract

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis