Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen; Zeng Li; Yifei Zhang; Chang Liu; Yu Zhou

arXiv:2605.14885·cs.CV·May 15, 2026

Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen, Zeng Li, Yifei Zhang, Chang Liu, Yu Zhou

PDF

1 Repo

TL;DR

The paper introduces Masked Next-Scale Prediction (MNSP), a self-supervised framework that models hierarchical structures in scene text recognition by predicting higher-resolution features from lower-resolution contexts, improving accuracy and robustness.

Contribution

MNSP is the first unified self-supervised approach to explicitly model cross-scale structural evolution in scene text recognition, combining cross-scale prediction with masked reconstruction.

Findings

01

Achieves 86.2% accuracy on Union14M benchmark.

02

Reaches 96.7% accuracy across six standard datasets.

03

Improves robustness under extreme scale and layout variations.

Abstract

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CzhczhcHczh/MNSP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.