Large Language Models for Summarizing Czech Historical Documents and Beyond
V\'aclav Tran, Jakub \v{S}m\'id, Ji\v{r}\'i Mart\'inek, Ladislav Lenc, Pavel Kr\'al

TL;DR
This paper advances Czech text summarization by applying large language models to modern and historical documents, achieving state-of-the-art results and introducing a new dataset for historical Czech texts.
Contribution
It introduces a new dataset for historical Czech summarization and demonstrates state-of-the-art performance on modern Czech summarization using large language models.
Findings
State-of-the-art results on SumeCzech dataset
Introduction of Posel od Čerchova dataset for historical texts
Baseline results for Czech historical document summarization
Abstract
Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
