DODO: Discrete OCR Diffusion Models

Sean Man; Roy Ganz; Roi Ronen; Shahar Tsiper; Shai Mazor; Niv Nayman

arXiv:2602.16872·cs.CV·February 20, 2026

DODO: Discrete OCR Diffusion Models

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

PDF

Open Access

TL;DR

DODO introduces a novel block discrete diffusion approach for OCR, enabling faster, parallel decoding that maintains high accuracy, addressing the inefficiency of autoregressive models in processing long documents.

Contribution

This paper presents DODO, the first vision-language model to apply block discrete diffusion for OCR, significantly improving inference speed while preserving accuracy.

Findings

01

Achieves up to 3x faster inference than autoregressive models.

02

Maintains near state-of-the-art OCR accuracy.

03

Successfully mitigates diffusion model instabilities for rigid OCR tasks.

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning