A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Lhuqita Fazry

arXiv:2505.06862·cs.CL·May 13, 2025

A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Lhuqita Fazry

PDF

Open Access 1 Repo

TL;DR

This paper proposes a split-then-join method for abstractive summarization of very long documents using a limited-capacity pretrained model, improving performance without truncation by splitting documents and fine-tuning.

Contribution

It introduces a novel split-then-join approach that enables effective summarization of very long documents with limited model capacity, addressing domain shift and overfitting issues.

Findings

01

Improved summarization quality on very long documents.

02

Effective handling of documents over 20,000 tokens.

03

Demonstrated benefits of splitting and fine-tuning approach.

Abstract

$BIGBIRD-PEGASUS$ model achieves $state-of-the-art$ on abstractive text summarization for long documents. However it's capacity still limited to maximum of $4, 096$ tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained $BIGBIRD-PEGASUS$ model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than $20, 000$ tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into $4, 096$ tokens. Source code available on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lhfazry/spin-summ
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Data Quality and Management

MethodsFocus