MDiff4STR: Mask Diffusion Model for Scene Text Recognition

Yongkun Du; Miaomiao Zhao; Songlin Fan; Zhineng Chen; Caiyan Jia; Yu-Gang Jiang

arXiv:2512.01422·cs.CV·December 2, 2025

MDiff4STR: Mask Diffusion Model for Scene Text Recognition

Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia, Yu-Gang Jiang

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces MDiff4STR, a novel Mask Diffusion Model tailored for Scene Text Recognition, which improves accuracy over vanilla diffusion models and surpasses state-of-the-art auto-regressive models while maintaining high efficiency.

Contribution

The paper presents MDiff4STR, a diffusion-based approach for STR with two key strategies to address training-inference gap and overconfidence, achieving superior accuracy and efficiency.

Findings

01

MDiff4STR outperforms popular STR models across various benchmarks.

02

It surpasses state-of-the-art auto-regressive models in accuracy.

03

Maintains fast inference with only three denoising steps.

Abstract

Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
topdu/MDiff4STR
model

Datasets

dlxjj/OpenOCR
dataset· 1.3k dl
1.3k dl

Videos

MDiff4STR: Mask Diffusion Model for Scene Text Recognition· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Domain Adaptation and Few-Shot Learning