ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through   Regularized Self-Attention

Yang Liu; Jiaxiang Liu; Li Chen; Yuxiang Lu; Shikun Feng; Zhida Feng,; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

arXiv:2203.12276·cs.CL·March 24, 2022·6 cites

ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention

Yang Liu, Jiaxiang Liu, Li Chen, Yuxiang Lu, Shikun Feng, Zhida Feng,, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

PDF

Open Access

TL;DR

ERNIE-SPARSE introduces a hierarchical sparse transformer with a novel regularization technique, significantly improving efficiency and performance on long sequence modeling and downstream NLP tasks.

Contribution

The paper proposes ERNIE-SPARSE, combining hierarchical sparse attention and self-attention regularization to enhance transformer efficiency and effectiveness.

Findings

01

Outperforms baseline methods on Long Range Arena benchmark by 2.77%.

02

Achieves higher accuracy on text classification and QA tasks.

03

Demonstrates significant improvements in long sequence modeling and downstream NLP tasks.

Abstract

Sparse Transformer has recently attracted a lot of attention since the ability for reducing the quadratic dependency on the sequence length. We argue that two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information. (ii) Self-Attention Regularization (SAR) method, a novel regularization designed to minimize the distance for transformers with different attention topologies. To evaluate the effectiveness of ERNIE-Sparse, we perform extensive evaluations. Firstly, we perform experiments on a multi-modal long sequence modeling task benchmark, Long Range Arena (LRA). Experimental results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Cosine Annealing · Dropout · Layer Normalization