ExFusion: Efficient Transformer Training via Multi-Experts Fusion

Jiacheng Ruan; Daize Dong; Xiaoye Qu; Tong Zhu; Ting Liu; Yuzhuo Fu; Yu Cheng; Suncheng Xiang

arXiv:2603.27965·cs.CV·March 31, 2026

ExFusion: Efficient Transformer Training via Multi-Experts Fusion

Jiacheng Ruan, Daize Dong, Xiaoye Qu, Tong Zhu, Ting Liu, Yuzhuo Fu, Yu Cheng, Suncheng Xiang

PDF

TL;DR

ExFusion is a novel pre-training method that efficiently leverages multi-expert fusion in Transformer models, enhancing performance with minimal additional computational cost and no extra deployment overhead.

Contribution

It introduces a multi-expert fusion approach during Transformer pre-training that maintains efficiency and reduces deployment complexity.

Findings

01

Effective multi-expert fusion improves Transformer training efficiency.

02

Achieves comparable performance with reduced computational overhead.

03

Simplifies deployment by consolidating experts post-training.

Abstract

Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.