MDiff4STR: Mask Diffusion Model for Scene Text Recognition
Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia, Yu-Gang Jiang

TL;DR
This paper introduces MDiff4STR, a novel Mask Diffusion Model tailored for Scene Text Recognition, which improves accuracy over vanilla diffusion models and surpasses state-of-the-art auto-regressive models while maintaining high efficiency.
Contribution
The paper presents MDiff4STR, a diffusion-based approach for STR with two key strategies to address training-inference gap and overconfidence, achieving superior accuracy and efficiency.
Findings
MDiff4STR outperforms popular STR models across various benchmarks.
It surpasses state-of-the-art auto-regressive models in accuracy.
Maintains fast inference with only three denoising steps.
Abstract
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Domain Adaptation and Few-Shot Learning
