SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood   Filling

Bokyeong Yoon; Yoonsang Han; Gordon Euhyun Moon

arXiv:2309.12578·cs.LG·September 25, 2023

SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon

PDF

Open Access

TL;DR

This paper introduces SPION, a novel layer-wise sparse training method for Transformers that combines convolutional filtering and flood filling to reduce computational costs and memory usage, achieving significant speedups.

Contribution

The paper presents a new sparsification scheme integrating convolution filters and flood filling for layer-wise sparse attention in Transformers, improving efficiency and performance.

Findings

01

Achieves up to 3.08X speedup over existing sparse Transformers.

02

Reduces computational complexity and memory footprint during training.

03

Demonstrates better evaluation quality with the proposed method.

Abstract

Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Machine Learning and ELM

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Adam · Residual Connection · Attention Dropout · Layer Normalization · Label Smoothing · Byte Pair Encoding · Dropout