OCRTurk: A Comprehensive OCR Benchmark for Turkish
Deniz Y{\i}lmaz, Evren Ayberk Munis, \c{C}a\u{g}r{\i} Toraman, S\"uha Ka\u{g}an K\"ose, Burak Akta\c{s}, Mehmet Can Baytekin, Bilge Kaan G\"or\"ur

TL;DR
OCRTurk is a new benchmark dataset designed to evaluate OCR models on diverse Turkish documents, addressing the lack of standardized resources for low-resource languages and enabling assessment of model robustness across various document types.
Contribution
The paper introduces OCRTurk, a comprehensive Turkish document parsing benchmark with diverse document types and difficulty levels, filling a critical gap in low-resource language OCR evaluation.
Findings
PaddleOCR achieves the best overall performance.
Model accuracy varies significantly across document types.
Slideshows are the most challenging document category.
Abstract
Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Text Readability and Simplification
