TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media   Posts

Sajad Sotudeh; Hanieh Deilamsalehy; Franck Dernoncourt; Nazli Goharian

arXiv:2110.01159·cs.CL·October 6, 2021

TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, Nazli Goharian

PDF

Open Access 1 Repo

TL;DR

This paper introduces TLDR9+, a large-scale Reddit-based dataset with over 9 million instances for extreme summarization, and a high-quality subset, enabling better training of summarization models.

Contribution

The creation of TLDR9+, the largest dataset for extreme social media summarization, and TLDRHQ, a refined high-quality subset for improved model training.

Findings

01

TLDR9+ contains over 9 million training instances.

02

TLDRHQ is a fine-grained, high-quality subset of TLDR9+.

03

Different state-of-the-art models are evaluated on these datasets.

Abstract

Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ -- a large-scale summarization dataset -- containing over 9 million training instances extracted from Reddit discussion forum (https://github.com/sajastu/reddit_collector). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sajastu/reddit_collector
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies