MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Da Xiao; Qingye Meng; Shengping Li; Xingyuan Yuan

arXiv:2502.12170·cs.LG·May 29, 2025

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

PDF

Open Access 3 Models

TL;DR

MUDDFormer introduces dynamic dense residual connections that adapt based on hidden states, significantly improving Transformer performance and efficiency across various scales and tasks.

Contribution

It presents MUDD connections that dynamically generate weights for residuals, enhancing cross-layer communication in Transformers with minimal additional parameters.

Findings

01

Outperforms standard Transformers across multiple architectures and scales.

02

Achieves similar performance to larger models with fewer parameters and less computation.

03

Matches or surpasses larger models in downstream tasks and few-shot settings.

Abstract

We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Memory and Neural Computing