RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
Andrei-Marius Avram, Mircea Timpuriu, Andreea Iuga, Vlad-Cristian, Matei, Iulian-Marius T\u{a}iatu, Tudor G\u{a}in\u{a}, Dumitru-Clementin, Cercel, Florin Pop, Mihaela-Claudia Cercel

TL;DR
This paper introduces RoLargeSum, a large-scale Romanian news dataset with summaries, headlines, and keywords, enabling better development of summarization models for Romanian and similar languages.
Contribution
The creation of RoLargeSum, a comprehensive Romanian news dataset with rich metadata, and benchmarking of various language models on this dataset.
Findings
BART variants achieved promising results on Romanian summarization.
Open-source large language models showed potential but need further refinement.
Manual evaluation revealed specific challenges in Romanian summarization quality.
Abstract
Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Information Retrieval and Search Behavior
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Softmax · Adam · Layer Normalization
