Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobrani\'c; Tina Munda; Oliver Peji\'c; Vojko Gorjanc; Uro\v{s} \v{S}majdek; David Bordon; Jakob Lenardi\v{c}; Tja\v{s}a Konov\v{s}ek; Kristina Pahor de Maiti Tekav\v{c}i\v{c}; Ciril Bohak; Darja Fi\v{s}er

arXiv:2603.25051·cs.CL·March 30, 2026

Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobrani\'c, Tina Munda, Oliver Peji\'c, Vojko Gorjanc, Uro\v{s} \v{S}majdek, David Bordon, Jakob Lenardi\v{c}, Tja\v{s}a Konov\v{s}ek, Kristina Pahor de Maiti Tekav\v{c}i\v{c}, Ciril Bohak, Darja Fi\v{s}er

PDF

TL;DR

This paper combines topic modeling, LLM-based sentiment analysis, and discourse analysis to study Slovene historical newspapers, revealing ideological differences and identity portrayals at the turn of the twentieth century.

Contribution

It introduces a mixed methods approach integrating computational techniques and critical discourse analysis for analyzing noisy historical newspaper data.

Findings

01

Identified thematic patterns reflecting ideological orientations.

02

Selected GaMS3-12B-Instruct as the best LLM for sentiment analysis in degraded OCR data.

03

Revealed variation in portrayal of identities through entity-graph analysis.

Abstract

This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.