Predictor-Corrector Enhanced Transformers with Exponential Moving   Average Coefficient Learning

Bei Li; Tong Zheng; Rui Wang; Jiahao Liu; Qingyan Guo; Junliang Guo,; Xu Tan; Tong Xiao; Jingbo Zhu; Jingang Wang; Xunliang Cai

arXiv:2411.03042·cs.CL·November 6, 2024

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

Bei Li, Tong Zheng, Rui Wang, Jiahao Liu, Qingyan Guo, Junliang Guo,, Xu Tan, Tong Xiao, Jingbo Zhu, Jingang Wang, Xunliang Cai

PDF

Open Access 1 Video

TL;DR

This paper introduces a predictor-corrector framework with exponential moving average coefficient learning to improve Transformer models, achieving state-of-the-art results on multiple NLP benchmarks with fewer parameters.

Contribution

The work presents a novel predictor-corrector learning scheme combined with EMA-based coefficient learning to enhance Transformer accuracy and efficiency.

Findings

01

Achieved BLEU scores of 30.95 and 44.27 on WMT'14 English-German and French tasks.

02

Surpassed a 3.8B DeepNet by 2.9 BLEU points using only one-third of the parameters.

03

Outperformed LLama models by 5.7 points on the LM Harness Evaluation.

Abstract

Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning· slideslive

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and ELM · Blind Source Separation Techniques

MethodsLinear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings