The Evolved Transformer
David R. So, Chen Liang, Quoc V. Le

TL;DR
This paper uses neural architecture search to evolve a new Transformer variant, achieving better performance and efficiency on multiple language translation tasks compared to the original Transformer architecture.
Contribution
It introduces the Evolved Transformer, a novel architecture found via NAS with a new search method, outperforming the original Transformer on key benchmarks.
Findings
Achieved state-of-the-art BLEU score of 29.8 on WMT'14 English-German.
Evolved Transformer outperforms the original Transformer at smaller model sizes.
Reduced parameters by 37.6% while maintaining similar translation quality.
Abstract
Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic Properties and Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing
