A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
Lhuqita Fazry

TL;DR
This paper proposes a split-then-join method for abstractive summarization of very long documents using a limited-capacity pretrained model, improving performance without truncation by splitting documents and fine-tuning.
Contribution
It introduces a novel split-then-join approach that enables effective summarization of very long documents with limited model capacity, addressing domain shift and overfitting issues.
Findings
Improved summarization quality on very long documents.
Effective handling of documents over 20,000 tokens.
Demonstrated benefits of splitting and fine-tuning approach.
Abstract
model achieves on abstractive text summarization for long documents. However it's capacity still limited to maximum of tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into tokens. Source code available on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Data Quality and Management
MethodsFocus
