Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan; Omer Kidron; Gabriel Stanovsky

arXiv:2511.14598·cs.CL·November 19, 2025

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan, Omer Kidron, Gabriel Stanovsky

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel method to extract summarization data from digitized newspapers' front-page teasers across multiple languages, enabling the creation of new datasets for low-resource languages like Hebrew.

Contribution

The work presents a scalable automatic process to collect naturally occurring summaries from digitized newspapers, supporting multi-document summarization in under-represented languages.

Findings

01

Method successfully applied to seven languages.

02

Created HEBTEASESUM, the first Hebrew multi-document summarization dataset.

03

Demonstrated the method's scalability and language independence.

Abstract

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages· underline

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Advanced Text Analysis Techniques