ResiDual: Transformer with Dual Residual Connections
Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany, Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan

TL;DR
ResiDual introduces a novel Transformer architecture that combines the advantages of Post-LN and Pre-LN residual connections, effectively addressing their individual limitations and improving training stability and model capacity.
Contribution
The paper proposes ResiDual, a new Transformer design with a Pre-Post-LN structure that unites the benefits of existing residual connection variants while avoiding their drawbacks.
Findings
ResiDual reduces gradient vanishing issues in deep Transformers.
ResiDual prevents representation collapse, enhancing model capacity.
Empirical results show ResiDual outperforms Post-LN and Pre-LN on translation benchmarks.
Abstract
Transformer networks have become the preferred architecture for many tasks due to their state-of-the-art performance. However, the optimal way to implement residual connections in Transformer, which are essential for effective training, is still debated. Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Transformers, which apply layer normalization after each residual block's output or before each residual block's input, respectively. While both variants enjoy their advantages, they also suffer from severe limitations: Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and…
Peer Reviews
Decision·Submitted to ICLR 2024
+ The paper is well-organized, with a clear introduction to the problem, a detailed methodology, and a presentation of results that make it accessible to readers. + The paper introduces a novel architecture, ResiDual, which creatively combines the benefits of Post-LN and Pre-LN Transformers to address their respective limitations. + The submission includes a thorough theoretical examination of the gradient vanishing and representation collapse problems, providing a solid foundation for the propo
- Experiments might be restricted to machine translation tasks, which does not demonstrate the model's generalizability across different domains or tasks in NLP or other areas where Transformers are applicable. - Comparisons with models that have similar enhancements (e.g. RealFormer) to the Transformer architecture, leaving the evaluation incomplete. - Missing analysis towards computational costs or training efficiency of the ResiDual model.
About Approach: The introduction of the ResiDual model offers a fresh perspective on addressing the challenges faced by the Transformer architecture. Comprehensive Experiments: The paper provides extensive experimental results on multiple datasets, showcasing the model's robustness and versatility. Performance: The ResiDual model show some improvements over other methods. Stability in Training: The research highlights that the ResiDual model does not require learning-rate warm-up for converge
I'm not an expert in NLP and I have limited knowledge on this. However, one of my concern is about the incrementally performance. Also, the authors claimed "Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity". Thanks for the theoretical analysis in Sec. 2. However, if the authors could provide more empirical evidence to support that?
1. The authors method avoid the "representation collapse" issue, where the representation changes less for each new layer in Pre-LN. 2. The authors compare their method on multiple datasets
1. The treatment of grad norm of Pre-LN seems to be sometimes empirically incorrect (see Questions for authors), with wildly different observations in reality compared to the author's theoretical treatment 1. Their derivations of theorem 3.1 ignore the impact of non-linearity on the gradient of the transformer MLP layer, instead assuming it to be a single FC. 1. The derivation also ignores back-propagated gradient through the Query - while Keys will not back-propagate any gradient as Query is z
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings
