Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies
Zhengmi Tang, Yuto Mitsui, Tomo Miyazaki, Shinichiro Omachi

TL;DR
This paper introduces a multi-masking strategy for masked image modeling that enhances both low-level and high-level textual representations, significantly improving performance on various text recognition tasks.
Contribution
It proposes a novel multi-masking strategy combining patch, blockwise, and span masking to better capture high-level contextual information in self-supervised text recognition.
Findings
Outperforms state-of-the-art self-supervised methods in text recognition
Improves performance in text segmentation and super-resolution tasks
Effectively captures both low-level textures and high-level context
Abstract
Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsMutual Information Machine/Mask Image Modeling · Contrastive Learning
