Hyper-Connections
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu, Wu, Qiyang Min, Xun Zhou

TL;DR
Hyper-connections offer a new method to improve neural network training by dynamically adjusting feature connections, addressing residual connection drawbacks, and enhancing performance in language and vision tasks.
Contribution
This paper introduces hyper-connections as an alternative to residual connections, enabling dynamic adjustment of feature link strengths and layer reordering.
Findings
Significant performance improvements in large language model pre-training
Enhanced results in vision tasks with hyper-connections
Effective alternative to residual connections across AI domains
Abstract
We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.
Peer Reviews
Decision·ICLR 2025 Poster
- The paper provides a clear and systematic extension to residual connections named Dynamic Hyper-Connections (DHC), where residual could be consider a static hyperconnection, addressing the trade-off between gradient vanishing and representation collapse. - Experimenttal results demonstrated effectiveness of DHC across diverse domains, including LLM pretraining and vision tasks.
- There seem a lack of comparsion to a fullt enabled depth-connections and width-connections (DenseNet style) where all of the connection in Figure 2 are enabled and learnable. - The main result focus on LLM and downstream peformance on langugage tasks, the result on Vision task in the appendix seem to demonstrate less gain compare to langugage tasks, can the author elebrate more on this?
- There is a clear signal that incorporating hyper-connections in LLMs architectures, without any other modification, improves the training loss for a given number of tokens, and boosts performance on downstream metrics. This result is validated for both dense and MOE architectures. - Hyper-connexions help reduce training instabilities. This is clear by looking at Figure 5 and 6, the training curves of the models with hyper-connections are smoother and do not have spikes, which is a major advan
- The main concern I have with this paper is the computational impact of replicating the activations of the network $n$ times for hyper-connections. There is no study on the computational impact both in terms of running time and memory usage. The authors mention Line 394 that “Both methods expand the hidden size by n times with negligible computational overhead” but it is not shown with a proper experiment on the throughput, overall running time, and peak memory usage. Also, it seems that n=1 pe
1. The results on LLM benchmarks and losses suggest a better balance between vanishing gradients and representation collapse. 2. Section 4.5 discusses the effect of hyperconnections, which displays that hyperconnections eliminate input embeddings from the output, form parallel blocks which have less reliance on each other increasing chances for unique representations. 3.Parallel block formation is particularly important as similar layers in a transformer block tends to learn similar represent
1. The main drawback is when creating $n$ copies, it leads to a considerable amount of increase in memory, though the burden can be reduced through engineering, the impact is yet to be known. 2. If the goal of creating multiple copies is just to make sure multiple depth connections can be modelled parallelly, is creating such copies actually necessary? Can't a single copy be used with different residual strengths? The only difference would be in gradient computation , were additional terms for
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
MethodsResidual Connection
