Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU
Hossam Amer, Young Jin Kim, Mohamed Afify, Hitokazu Matsushita, Hany, Hassan Awadallah

TL;DR
This paper introduces a clustering-based vocabulary projection method for multilingual neural machine translation on GPUs, significantly improving inference speed while maintaining translation quality.
Contribution
It proposes a novel clustering approach to reduce vocab size during projection, enabling faster GPU inference in multilingual transformers.
Findings
Speed up vocab projection by up to 2.6x
Achieves 25% end-to-end inference speed gain on GPU
Maintains BLEU scores and translation quality
Abstract
Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
