SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling
Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon

TL;DR
This paper introduces SPION, a novel layer-wise sparse training method for Transformers that combines convolutional filtering and flood filling to reduce computational costs and memory usage, achieving significant speedups.
Contribution
The paper presents a new sparsification scheme integrating convolution filters and flood filling for layer-wise sparse attention in Transformers, improving efficiency and performance.
Findings
Achieves up to 3.08X speedup over existing sparse Transformers.
Reduces computational complexity and memory footprint during training.
Demonstrates better evaluation quality with the proposed method.
Abstract
Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Machine Learning and ELM
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Adam · Residual Connection · Attention Dropout · Layer Normalization · Label Smoothing · Byte Pair Encoding · Dropout
