DFX: A Low-latency Multi-FPGA Appliance for Accelerating   Transformer-based Text Generation

Seongmin Hong; Seungjae Moon; Junsoo Kim; Sungjae Lee; Minsub Kim,; Dongsoo Lee; Joo-Young Kim

arXiv:2209.10797·eess.SY·September 26, 2022·5 cites

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim,, Dongsoo Lee, Joo-Young Kim

PDF

Open Access

TL;DR

This paper introduces DFX, a multi-FPGA hardware platform that accelerates GPT-2 inference with low latency and high throughput, outperforming GPUs in speed, energy efficiency, and cost for natural language generation tasks.

Contribution

The paper presents a novel multi-FPGA architecture optimized for GPT-2 inference, achieving significant speedup and efficiency improvements over GPU-based solutions.

Findings

01

DFX achieves 5.58x speedup over NVIDIA V100 GPUs.

02

DFX provides 3.99x better energy efficiency than GPUs.

03

DFX is 8.21x more cost-effective than GPU appliances.

Abstract

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Weight Decay · Attention Dropout · Discriminative Fine-Tuning · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia?