Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive
Suphan Kirmizialtin, David Wrisley

TL;DR
This paper explores deep learning-based methods for automatically transcribing Ottoman Turkish periodicals written in Arabic script into Latin script, addressing historical, technical, and linguistic challenges.
Contribution
It introduces a novel approach to train HTR models for Ottoman Turkish, converting Arabic script texts into Latin script, and discusses the implications of script change and domain bias.
Findings
Successful training of HTR models for Ottoman Turkish periodicals
Demonstrated transcriptions in Latin script from Arabic script texts
Highlighted challenges of script conversion and domain bias
Abstract
Our study utilizes deep learning methods for the automated transcription of late nineteenth- and early twentieth-century periodicals written in Arabic script Ottoman Turkish (OT) using the Transkribus platform. We discuss the historical situation of OT text collections and how they were excluded for the most part from the late twentieth century corpora digitization that took place in many Latin script languages. This exclusion has two basic reasons: the technical challenges of OCR for Arabic script languages, and the rapid abandonment of that very script in the Turkish historical context. In the specific case of OT, opening periodical collections to digital tools require training HTR models to generate transcriptions in the Latin writing system of contemporary readers of Turkish, and not, as some may expect, in right-to-left Arabic script text. In the paper we discuss the challenges of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Digital Humanities and Scholarship
