Fast Neural Machine Translation Implementation
Hieu Hoang, Tomasz Dwojak, Rihards Krislauks, Daniel Torregrosa,, Kenneth Heafield

TL;DR
This paper presents optimized GPU implementations for neural machine translation, achieving high efficiency through mini-batching and operation fusion, with submissions ranking top in GPU speed.
Contribution
The paper introduces novel implementation techniques for neural machine translation on GPUs, notably mini-batching and softmax fusion, enhancing inference speed.
Findings
Amun achieves the fastest GPU inference speeds.
Efficient mini-batching improves throughput.
Operation fusion reduces latency.
Abstract
This paper describes the submissions to the efficiency track for GPUs at the Workshop for Neural Machine Translation and Generation by members of the University of Edinburgh, Adam Mickiewicz University, Tilde and University of Alicante. We focus on efficient implementation of the recurrent deep-learning model as implemented in Amun, the fast inference engine for neural machine translation. We improve the performance with an efficient mini-batching algorithm, and by fusing the softmax operation with the k-best extraction algorithm. Submissions using Amun were first, second and third fastest in the GPU efficiency track.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdam · Softmax
