TL;DR
The paper introduces Masked Next-Scale Prediction (MNSP), a self-supervised framework that models hierarchical structures in scene text recognition by predicting higher-resolution features from lower-resolution contexts, improving accuracy and robustness.
Contribution
MNSP is the first unified self-supervised approach to explicitly model cross-scale structural evolution in scene text recognition, combining cross-scale prediction with masked reconstruction.
Findings
Achieves 86.2% accuracy on Union14M benchmark.
Reaches 96.7% accuracy across six standard datasets.
Improves robustness under extreme scale and layout variations.
Abstract
Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
