Accelerating Transformer Inference for Translation via Parallel Decoding

Andrea Santilli; Silvio Severino; Emilian Postolache; Valentino; Maiorca; Michele Mancusi; Riccardo Marin; Emanuele Rodol\`a

arXiv:2305.10427·cs.CL·February 6, 2025·2 cites

Accelerating Transformer Inference for Translation via Parallel Decoding

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino, Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodol\`a

PDF

Open Access 3 Repos

TL;DR

This paper introduces parallel decoding algorithms for transformer-based machine translation that significantly speed up inference without retraining or modifying models, by reframing autoregressive decoding as fixed-point iterations.

Contribution

It proposes novel parallel decoding methods based on Jacobi and Gauss-Seidel iterations, enabling faster inference while preserving translation quality.

Findings

01

Speedup of up to 38% over standard decoding

02

Nearly 2x speedup with parallel resource scaling

03

Introduces a decoding dependency graph visualizer (DDGviz)

Abstract

Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the MT model, trading inference speed at the cost of the translation quality. In this paper, we propose to address the problem from the point of view of decoding algorithms, as a less explored but rather compelling direction. We propose to reframe the standard greedy autoregressive decoding of MT with a parallel formulation leveraging Jacobi and Gauss-Seidel fixed-point iteration methods for fast inference. This formulation allows to speed up existing models without training or modifications while retaining translation quality. We present three parallel decoding algorithms and test them on different languages and models showing how the parallelization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Neural Networks and Applications

MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings