Large Language Models for Summarizing Czech Historical Documents and Beyond

V\'aclav Tran; Jakub \v{S}m\'id; Ji\v{r}\'i Mart\'inek; Ladislav Lenc; Pavel Kr\'al

arXiv:2508.10368·cs.CL·August 15, 2025

Large Language Models for Summarizing Czech Historical Documents and Beyond

V\'aclav Tran, Jakub \v{S}m\'id, Ji\v{r}\'i Mart\'inek, Ladislav Lenc, Pavel Kr\'al

PDF

TL;DR

This paper advances Czech text summarization by applying large language models to modern and historical documents, achieving state-of-the-art results and introducing a new dataset for historical Czech texts.

Contribution

It introduces a new dataset for historical Czech summarization and demonstrates state-of-the-art performance on modern Czech summarization using large language models.

Findings

01

State-of-the-art results on SumeCzech dataset

02

Introduction of Posel od Čerchova dataset for historical texts

03

Baseline results for Czech historical document summarization

Abstract

Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.