Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

\c{S}aziye Bet\"ul \"Ozate\c{s}; Tar{\i}k Emre T{\i}ra\c{s}; Ece Elif Adak; Berat Do\u{g}an; Fatih Burak Karag\"oz; Efe Eren Gen\c{c}; Esma F. Bilgin Ta\c{s}demir

arXiv:2501.04828·cs.CL·March 30, 2026·3 cites

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

\c{S}aziye Bet\"ul \"Ozate\c{s}, Tar{\i}k Emre T{\i}ra\c{s}, Ece Elif Adak, Berat Do\u{g}an, Fatih Burak Karag\"oz, Efe Eren Gen\c{c}, Esma F. Bilgin Ta\c{s}demir

PDF

1 Repo 1 Models

TL;DR

This paper develops foundational NLP resources and transformer models for historical Turkish, including datasets, a corpus, and benchmarks, to advance computational analysis of this underexplored language domain.

Contribution

It introduces the first NER dataset, Universal Dependencies treebank, and a transliterated corpus for historical Turkish, along with trained models for key NLP tasks.

Findings

01

Achieved 90.29% F1 in NER

02

Attained 73.79% LAS in dependency parsing

03

Reached 94.98% F1 in POS tagging

Abstract

This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://hf.co/bucolin
github

Models

🤗
fatihburakkaragoz/ottoman-ner-latin
model· 3 dl· ♡ 7
3 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.