SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition

Omer Nacar; Yasser Al-Habashi; Serry Sibaee; Adel Ammar; Wadii Boulila

arXiv:2505.24600·cs.CV·June 2, 2025

SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition

Omer Nacar, Yasser Al-Habashi, Serry Sibaee, Adel Ammar, Wadii Boulila

PDF

Open Access 3 Datasets

TL;DR

SARD is a large-scale, synthetically generated Arabic OCR dataset designed to facilitate the training and evaluation of OCR models on diverse, book-style Arabic texts, overcoming limitations of existing datasets.

Contribution

We introduce SARD, a massive synthetic Arabic OCR dataset with 843,622 images and 690 million words, covering diverse fonts and layouts for improved model training.

Findings

01

Benchmark results demonstrate the dataset's utility for OCR model development.

02

Synthetic data enables scalable and controlled training environments.

03

The dataset highlights challenges in Arabic OCR, guiding future research.

Abstract

Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing

MethodsFocus