Pre-trained Summarization Distillation
Sam Shleifer, Alexander M. Rush

TL;DR
This paper compares three distillation methods for large pre-trained summarization models, finding that the 'shrink and fine-tune' approach often outperforms others on certain datasets, with results demonstrated on Pegasus and BART models.
Contribution
It provides a comprehensive comparison of distillation techniques for summarization models, highlighting the effectiveness of the 'shrink and fine-tune' method across different datasets.
Findings
SFT outperforms knowledge distillation on CNN/DailyMail.
Pseudo-labeling performs better on XSUM.
Code and models are publicly available.
Abstract
Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lxyuan/distilbart-finetuned-summarizationmodel· 8 dl· ♡ 68 dl♡ 6
- 🤗tarekziade/t5-small-headline-generator-sft-3-3model· 3 dl3 dl
- 🤗Dragon116rus/whisper-small-distill-rumodel· 4 dl4 dl
- 🤗Dragon116rus/trainingmodel· 2 dl2 dl
- 🤗supawichwac/trainingmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗GalaktischeGurke/trainingmodel
- 🤗nullonesix/trainingmodel· 1 dl1 dl
- 🤗Mike136/trainingmodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsPEGASUS · Linear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Shrink and Fine-Tune · Layer Normalization · Byte Pair Encoding · Softmax · Adam
