Transformer Fusion with Optimal Transport

Moritz Imfeld; Jacopo Graldi; Marco Giordano; Thomas Hofmann; Sotiris; Anagnostidis; Sidak Pal Singh (ETH Zurich)

arXiv:2310.05719·cs.LG·April 23, 2024

Transformer Fusion with Optimal Transport

Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris, Anagnostidis, Sidak Pal Singh (ETH Zurich)

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a novel method for fusing multiple transformer-based neural networks using Optimal Transport, enabling model compression and improved performance across vision and language tasks.

Contribution

It presents a systematic approach for aligning and fusing transformer components, including heterogeneous models, using optimal transport, which is a significant advancement over prior fusion methods.

Findings

01

Outperforms vanilla fusion methods in experiments

02

Enables effective fusion of models of different sizes

03

Achieves superior results after short fine-tuning

Abstract

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- This paper is well written. - The proposed method shows good generalization across different architectures. - The proposed method show strong performance for several benchmark.

Weaknesses

- Most experiments are conducted to compare with Vanilla Fusion. More comparisons with state-of-the-art methods should be included. - Most experiments are conducted on CIFAR dataset which is relatively small.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- This paper is well-structured. - To the best of my knowledge, this is the first work that aims to fuse transformer architectures by aligning their weights. - The proposed method is successfully backed by theoretical results.

Weaknesses

- The methodology part is not well-written and lacks some details.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The authors examined various strategies (weight vs activation, hard vs soft etc) for applying optimal transport (OT) methods 2. The authors conducted experiments employing both Vision Transformer (ViT) and BERT architectures across multiple datasets. 3. The OT method demonstrates particular efficacy in one-shot scenarios. 4. OT methods exhibit versatility, as they can be effectively applied to models of varying widths, presenting a viable alternative to distillation.

Weaknesses

1. The OT method yields comparatively lower performance when contrasted with ensemble methods. 2. The suitability of the OT method for achieving solid results on larger datasets, such as ImageNet-1K, in one-shot scenarios remains uncertain.

Code & Models

Repositories

graldij/transformer-fusion
pytorchOfficial

Videos

Transformer Fusion with Optimal Transport· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · WordPiece · Attention Dropout