The Devil is in the Details -- From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction
Armin Hoenen

TL;DR
This paper evaluates OCR systems for Old Church Slavonic manuscripts, explores post-processing with LLMs, and introduces a new image-based stemmatology method for manuscript analysis.
Contribution
It compares various OCR systems including LLMs for handwritten manuscripts and proposes a novel image processing pipeline for stemmatology.
Findings
OCR CER can reach 2-3% for basic letters
Elaborated diacritics remain challenging for OCR
The new image-based stemmatic method is demonstrated on two corpora
Abstract
The age of artificial intelligence has brought many new possibilities and pitfalls in many fields and tasks. The devil is in the details, and those come to the fore when building new pipelines and executing small practical experiments. OCR and stemmatology are no exception. The current investigation starts comparing a range of OCR-systems, from classical over machine learning to LLMs, for roughly 6,000 characters of late handwritten church slavonic manuscripts from the 18th century. Focussing on basic letter correctness, more than 10 CS OCR-systems among which 2 LLMs (GPT5 and Gemini3-flash) are being compared. Then, post-processing via LLMs is assessed and finally, different agentic OCR architectures (specialized post-processing agents, an agentic pipeline and RAG) are tested. With new technology elaborated, experiments suggest, church slavonic CER for basic letters may reach as low as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
