Transformer Fusion with Optimal Transport
Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris, Anagnostidis, Sidak Pal Singh (ETH Zurich)

TL;DR
This paper introduces a novel method for fusing multiple transformer-based neural networks using Optimal Transport, enabling model compression and improved performance across vision and language tasks.
Contribution
It presents a systematic approach for aligning and fusing transformer components, including heterogeneous models, using optimal transport, which is a significant advancement over prior fusion methods.
Findings
Outperforms vanilla fusion methods in experiments
Enables effective fusion of models of different sizes
Achieves superior results after short fine-tuning
Abstract
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is…
Peer Reviews
Decision·ICLR 2024 poster
- This paper is well written. - The proposed method shows good generalization across different architectures. - The proposed method show strong performance for several benchmark.
- Most experiments are conducted to compare with Vanilla Fusion. More comparisons with state-of-the-art methods should be included. - Most experiments are conducted on CIFAR dataset which is relatively small.
- This paper is well-structured. - To the best of my knowledge, this is the first work that aims to fuse transformer architectures by aligning their weights. - The proposed method is successfully backed by theoretical results.
- The methodology part is not well-written and lacks some details.
1. The authors examined various strategies (weight vs activation, hard vs soft etc) for applying optimal transport (OT) methods 2. The authors conducted experiments employing both Vision Transformer (ViT) and BERT architectures across multiple datasets. 3. The OT method demonstrates particular efficacy in one-shot scenarios. 4. OT methods exhibit versatility, as they can be effectively applied to models of varying widths, presenting a viable alternative to distillation.
1. The OT method yields comparatively lower performance when contrasted with ensemble methods. 2. The suitability of the OT method for achieving solid results on larger datasets, such as ImageNet-1K, in one-shot scenarios remains uncertain.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · WordPiece · Attention Dropout
