DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

Gabriel Pimenta de Freitas Cardoso; Caio Lucas da Silva Chacon; Jonas Felipe da Fonseca Oliveira; and Paulo Henrique de Medeiros Araujo

arXiv:2604.14314·cs.CV·April 17, 2026

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

Gabriel Pimenta de Freitas Cardoso, Caio Lucas da Silva Chacon, Jonas Felipe da Fonseca Oliveira, and Paulo Henrique de Medeiros Araujo

PDF

1 Models 1 Datasets

Abstract

This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Dharma-AI/Dharma-OCR-LITE
model· 2.0k dl· ♡ 12
2.0k dl♡ 12

Datasets

Dharma-AI/DharmaOCR-Benchmark
dataset· 620 dl
620 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.