Attention Temperature Matters in Abstractive Summarization Distillation

Shengqiang Zhang; Xingxing Zhang; Hangbo Bao; Furu Wei

arXiv:2106.03441·cs.CL·March 2, 2022·6 cites

Attention Temperature Matters in Abstractive Summarization Distillation

Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that adjusting attention temperatures in Transformer models enhances the effectiveness of distillation for abstractive summarization, leading to smaller models with better performance and more abstractive summaries.

Contribution

It introduces a simple yet effective method of manipulating attention temperatures to improve pseudo-labeling in Transformer-based summarization distillation.

Findings

01

Improved distillation performance over vanilla methods.

02

Generated summaries are shorter and more abstractive.

03

Method is effective across multiple datasets.

Abstract

Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves over vanilla pseudo-labeling based methods. We also find that both the pseudo labels and summaries produced by our students are shorter and more abstractive. Our code is available at \url{https://github.com/Shengqiang-Zhang/plate}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shengqiang-zhang/plate
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Residual Connection · Dense Connections