Accelerating Transformer Inference for Translation via Parallel Decoding
Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino, Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodol\`a

TL;DR
This paper introduces parallel decoding algorithms for transformer-based machine translation that significantly speed up inference without retraining or modifying models, by reframing autoregressive decoding as fixed-point iterations.
Contribution
It proposes novel parallel decoding methods based on Jacobi and Gauss-Seidel iterations, enabling faster inference while preserving translation quality.
Findings
Speedup of up to 38% over standard decoding
Nearly 2x speedup with parallel resource scaling
Introduces a decoding dependency graph visualizer (DDGviz)
Abstract
Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the MT model, trading inference speed at the cost of the translation quality. In this paper, we propose to address the problem from the point of view of decoding algorithms, as a less explored but rather compelling direction. We propose to reframe the standard greedy autoregressive decoding of MT with a parallel formulation leveraging Jacobi and Gauss-Seidel fixed-point iteration methods for fast inference. This formulation allows to speed up existing models without training or modifications while retaining translation quality. We present three parallel decoding algorithms and test them on different languages and models showing how the parallelization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Neural Networks and Applications
MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
