KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Ahmed Heakl; Abdullah Sohail; Mukul Ranjan; Rania Hossam; Ghazi Shazan Ahmad; Mohamed El-Geish; Omar Maher; Zhiqiang Shen; Fahad Khan; Salman Khan

arXiv:2502.14949·cs.CV·June 30, 2025

KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Shazan Ahmad, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan

PDF

1 Models 5 Datasets

TL;DR

KITAB-Bench introduces a comprehensive Arabic OCR benchmark with diverse datasets, revealing current models' limitations and guiding future improvements in Arabic document understanding.

Contribution

This paper presents the first large-scale, multi-domain Arabic OCR benchmark, filling a critical gap and providing a rigorous evaluation framework for Arabic document recognition models.

Findings

01

Modern vision-language models outperform traditional OCR by 60% in CER.

02

Current Arabic OCR models struggle with PDF-to-Markdown conversion, achieving only 65% accuracy.

03

Significant challenges remain in recognizing complex fonts, numerals, and table structures in Arabic texts.

Abstract

With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4o, Gemini, and Qwen) outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FatimahEmadEldin/Qari-OCR-Fine-Tuned-Kitab-Benchmark
model· 5 dl· ♡ 2
5 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer