A Streaming Approach For Efficient Batched Beam Search

Kevin Yang; Violet Yao; John DeNero; Dan Klein

arXiv:2010.02164·cs.CL·August 17, 2021

A Streaming Approach For Efficient Batched Beam Search

Kevin Yang, Violet Yao, John DeNero, Dan Klein

PDF

1 Repo

TL;DR

This paper introduces a streaming batching strategy for GPU-based variable-length decoding that significantly reduces runtime while maintaining translation quality, applicable across multiple NLP tasks.

Contribution

It presents a novel streaming batching method for efficient GPU decoding, improving speed without sacrificing output quality in machine translation and other NLP tasks.

Findings

01

Reduces decoding runtime by up to 71% compared to fixed-width beam search

02

Achieves 17% faster decoding than variable-width baseline while matching BLEU scores

03

Speeds up decoding in semantic and syntactic parsing tasks

Abstract

We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically "refills" the batch before proceeding with a selected subset of candidates. We apply our method to variable-width beam search on a state-of-the-art machine translation model. Our method decreases runtime by up to 71% compared to a fixed-width beam search baseline and 17% compared to a variable-width baseline, while matching baselines' BLEU. Finally, experiments show that our method can speed up decoding in other domains, such as semantic and syntactic parsing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangkevin2/emnlp2020-stream-beam-mt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.