Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding
Shanshan Wang, Zhumin Chen, Zhaochun Ren, Huasheng Liang, Qiang Yan,, Pengjie Ren

TL;DR
This paper introduces attention guiding mechanisms to enhance pre-trained language models by promoting diverse and comprehensive self-attention patterns, leading to improved performance across various NLP tasks.
Contribution
It proposes two novel attention guiding methods, MDG and PDG, to diversify and direct self-attention in PLMs, which improves their effectiveness without significant computational cost.
Findings
Consistent performance improvements across multiple models and datasets.
Enhanced attention diversity leads to better task-specific results.
Methods are efficient and easy to integrate into existing models.
Abstract
Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Multi-Head Attention · WordPiece · Dropout · Linear Warmup With Linear Decay · Weight Decay · LAMB
