DODO: Discrete OCR Diffusion Models
Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

TL;DR
DODO introduces a novel block discrete diffusion approach for OCR, enabling faster, parallel decoding that maintains high accuracy, addressing the inefficiency of autoregressive models in processing long documents.
Contribution
This paper presents DODO, the first vision-language model to apply block discrete diffusion for OCR, significantly improving inference speed while preserving accuracy.
Findings
Achieves up to 3x faster inference than autoregressive models.
Maintains near state-of-the-art OCR accuracy.
Successfully mitigates diffusion model instabilities for rigid OCR tasks.
Abstract
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
