Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
Noam Dahan, Omer Kidron, Gabriel Stanovsky

TL;DR
This paper introduces a novel method to extract summarization data from digitized newspapers' front-page teasers across multiple languages, enabling the creation of new datasets for low-resource languages like Hebrew.
Contribution
The work presents a scalable automatic process to collect naturally occurring summaries from digitized newspapers, supporting multi-document summarization in under-represented languages.
Findings
Method successfully applied to seven languages.
Created HEBTEASESUM, the first Hebrew multi-document summarization dataset.
Demonstrated the method's scalability and language independence.
Abstract
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Advanced Text Analysis Techniques
