Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning
Bei Li, Tong Zheng, Rui Wang, Jiahao Liu, Qingyan Guo, Junliang Guo,, Xu Tan, Tong Xiao, Jingbo Zhu, Jingang Wang, Xunliang Cai

TL;DR
This paper introduces a predictor-corrector framework with exponential moving average coefficient learning to improve Transformer models, achieving state-of-the-art results on multiple NLP benchmarks with fewer parameters.
Contribution
The work presents a novel predictor-corrector learning scheme combined with EMA-based coefficient learning to enhance Transformer accuracy and efficiency.
Findings
Achieved BLEU scores of 30.95 and 44.27 on WMT'14 English-German and French tasks.
Surpassed a 3.8B DeepNet by 2.9 BLEU points using only one-third of the parameters.
Outperformed LLama models by 5.7 points on the LM Harness Evaluation.
Abstract
Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM · Blind Source Separation Techniques
MethodsLinear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
