Paying More Attention to Self-attention: Improving Pre-trained Language   Models via Attention Guiding

Shanshan Wang; Zhumin Chen; Zhaochun Ren; Huasheng Liang; Qiang Yan,; Pengjie Ren

arXiv:2204.02922·cs.CL·April 7, 2022·1 cites

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Shanshan Wang, Zhumin Chen, Zhaochun Ren, Huasheng Liang, Qiang Yan,, Pengjie Ren

PDF

Open Access

TL;DR

This paper introduces attention guiding mechanisms to enhance pre-trained language models by promoting diverse and comprehensive self-attention patterns, leading to improved performance across various NLP tasks.

Contribution

It proposes two novel attention guiding methods, MDG and PDG, to diversify and direct self-attention in PLMs, which improves their effectiveness without significant computational cost.

Findings

01

Consistent performance improvements across multiple models and datasets.

02

Enhanced attention diversity leads to better task-specific results.

03

Methods are efficient and easy to integrate into existing models.

Abstract

Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Multi-Head Attention · WordPiece · Dropout · Linear Warmup With Linear Decay · Weight Decay · LAMB