Fast Vocabulary Projection Method via Clustering for Multilingual   Machine Translation on GPU

Hossam Amer; Young Jin Kim; Mohamed Afify; Hitokazu Matsushita; Hany; Hassan Awadallah

arXiv:2208.06874·cs.CL·August 16, 2022

Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU

Hossam Amer, Young Jin Kim, Mohamed Afify, Hitokazu Matsushita, Hany, Hassan Awadallah

PDF

Open Access

TL;DR

This paper introduces a clustering-based vocabulary projection method for multilingual neural machine translation on GPUs, significantly improving inference speed while maintaining translation quality.

Contribution

It proposes a novel clustering approach to reduce vocab size during projection, enabling faster GPU inference in multilingual transformers.

Findings

01

Speed up vocab projection by up to 2.6x

02

Achieves 25% end-to-end inference speed gain on GPU

03

Maintains BLEU scores and translation quality

Abstract

Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings