Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation
Nathikan Yodthapa, Thanapong Intharah, Sahan Bulathwela

TL;DR
This paper introduces a pairwise data augmentation method for training small language models to improve topic-controlled summarisation, demonstrating that augmentation scale enhances performance with fewer resources.
Contribution
The authors propose a novel contrastive data augmentation technique that enables small models to better learn topic-summaries relationships, reducing the need for large datasets.
Findings
Augmentation scale positively impacts model performance.
Small models achieve competitive results with less data.
Semantic alignment improves with increased augmentation.
Abstract
Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
