BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
Eva Sharma, Chen Li, and Lu Wang

TL;DR
BIGPATENT is a large-scale dataset of U.S. patent documents with human-written abstractive summaries that feature richer discourse, more even content distribution, and less extractive overlap, aiming to advance summarization research.
Contribution
The paper introduces BIGPATENT, a novel dataset with unique properties to improve understanding and generation of abstractive summaries in diverse discourse structures.
Findings
Baseline models perform poorly on the dataset.
Summarization models struggle with the complex discourse structure.
The dataset reveals new challenges for abstractive summarization.
Abstract
Most existing text summarization datasets are compiled from the news domain, where summaries have a flattened discourse structure. In such datasets, summary-worthy content often appears in the beginning of input articles. Moreover, large segments from input articles are present verbatim in their respective summaries. These issues impede the learning and evaluation of systems that can understand an article's global content structure as well as produce abstractive summaries with high compression ratio. In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BSC-LT/salamandra-7b-instructmodel· 81k dl· ♡ 7781k dl♡ 77
- 🤗unb-lamfo-nlp-mcti/NLP-ATS-MCTImodel
- 🤗BSC-LT/salamandra-7bmodel· 355 dl· ♡ 29355 dl♡ 29
- 🤗BSC-LT/salamandra-2bmodel· 1.3k dl· ♡ 251.3k dl♡ 25
- 🤗BSC-LT/salamandra-2b-instructmodel· 6.3k dl· ♡ 276.3k dl♡ 27
- 🤗robbiemu/salamandra-2b-instructmodel· 92 dl92 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-instruct-ggufmodel· 141 dl141 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-ggufmodel· 73 dl73 dl
- 🤗robbiemu/salamandra-2bmodel· 111 dl111 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-2b-instruct-ggufmodel· 356 dl356 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
