TL;DR
This paper introduces a streaming batching strategy for GPU-based variable-length decoding that significantly reduces runtime while maintaining translation quality, applicable across multiple NLP tasks.
Contribution
It presents a novel streaming batching method for efficient GPU decoding, improving speed without sacrificing output quality in machine translation and other NLP tasks.
Findings
Reduces decoding runtime by up to 71% compared to fixed-width beam search
Achieves 17% faster decoding than variable-width baseline while matching BLEU scores
Speeds up decoding in semantic and syntactic parsing tasks
Abstract
We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically "refills" the batch before proceeding with a selected subset of candidates. We apply our method to variable-width beam search on a state-of-the-art machine translation model. Our method decreases runtime by up to 71% compared to a fixed-width beam search baseline and 17% compared to a variable-width baseline, while matching baselines' BLEU. Finally, experiments show that our method can speed up decoding in other domains, such as semantic and syntactic parsing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
