Fast Neural Machine Translation Implementation

Hieu Hoang; Tomasz Dwojak; Rihards Krislauks; Daniel Torregrosa,; Kenneth Heafield

arXiv:1805.09863·cs.CL·June 11, 2018

Fast Neural Machine Translation Implementation

Hieu Hoang, Tomasz Dwojak, Rihards Krislauks, Daniel Torregrosa,, Kenneth Heafield

PDF

TL;DR

This paper presents optimized GPU implementations for neural machine translation, achieving high efficiency through mini-batching and operation fusion, with submissions ranking top in GPU speed.

Contribution

The paper introduces novel implementation techniques for neural machine translation on GPUs, notably mini-batching and softmax fusion, enhancing inference speed.

Findings

01

Amun achieves the fastest GPU inference speeds.

02

Efficient mini-batching improves throughput.

03

Operation fusion reduces latency.

Abstract

This paper describes the submissions to the efficiency track for GPUs at the Workshop for Neural Machine Translation and Generation by members of the University of Edinburgh, Adam Mickiewicz University, Tilde and University of Alicante. We focus on efficient implementation of the recurrent deep-learning model as implemented in Amun, the fast inference engine for neural machine translation. We improve the performance with an efficient mini-batching algorithm, and by fusing the softmax operation with the k-best extraction algorithm. Submissions using Amun were first, second and third fastest in the GPU efficiency track.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdam · Softmax