Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction
Mohamed Rashad

TL;DR
Arabic-Nougat introduces fine-tuned Vision Transformer models for Arabic OCR and Markdown extraction, achieving state-of-the-art accuracy and providing a large-scale Arabic token dataset, with open-source tools and resources.
Contribution
The paper presents specialized Arabic OCR models based on Meta's Nougat architecture, a new tokenizer, and a large Arabic token dataset, advancing Arabic document processing research.
Findings
State-of-the-art Markdown Structure Accuracy achieved
Lowest Character Error Rate among tested models
Released a 1.1 billion token Arabic dataset
Abstract
We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Mathematics, Computing, and Information Processing
MethodsSoftmax · Attention Is All You Need
