Large Language Models for the Summarization of Czech Documents: From History to the Present
V\'aclav Tran, Jakub \v{S}m\'id, Ladislav Lenc, Jean-Pierre Salmon, and Pavel Kr\'al

TL;DR
This paper explores the use of large multilingual language models for Czech document summarization, introducing new datasets and achieving state-of-the-art results, especially for historical texts and low-resource language settings.
Contribution
It demonstrates the effectiveness of LLMs for Czech summarization, introduces a new historical dataset, and provides baseline results to advance research in this underexplored area.
Findings
LLMs achieve state-of-the-art results on SumeCzech.
The translation-based approach improves summarization quality.
New dataset Posel od Čerchova supports historical Czech text summarization.
Abstract
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Biomedical Text Mining and Ontologies
