UMoE: Unifying Attention and FFN with Shared Experts

Yuanhang Yang; Chaozheng Wang; Jing Li

arXiv:2505.07260·cs.LG·October 24, 2025

UMoE: Unifying Attention and FFN with Shared Experts

Yuanhang Yang, Chaozheng Wang, Jing Li

PDF

Open Access

TL;DR

UMoE introduces a unified approach to sparse Mixture of Experts in Transformers by reformulating attention mechanisms to share parameters with FFN layers, leading to improved performance and efficiency.

Contribution

The paper presents UMoE, a novel architecture that unifies attention and FFN MoE layers through a reformulated attention mechanism enabling shared experts.

Findings

01

UMoE outperforms traditional MoE models in accuracy.

02

Shared experts reduce model complexity and improve efficiency.

03

Unified design simplifies implementation of MoE in Transformers.

Abstract

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Mixture of Experts · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax