Does Pretraining for Summarization Require Knowledge Transfer?
Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton

TL;DR
This paper investigates whether pretraining for text summarization truly relies on knowledge transfer from large datasets, finding that random character n-gram pretraining nearly matches real data performance, questioning the necessity of large corpora.
Contribution
The study challenges the common belief that knowledge transfer from large datasets is essential for effective summarization pretraining, showing that simpler pretraining tasks can achieve similar results.
Findings
Random character n-gram pretraining nearly matches real corpus performance
Pretraining tasks inspired by summarization data structure do not significantly improve results
Eliminating large datasets could reduce concerns over bias and copyright issues
Abstract
Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
