Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Zhengmi Tang; Yuto Mitsui; Tomo Miyazaki; Shinichiro Omachi

arXiv:2505.06855·cs.CV·May 13, 2025

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Zhengmi Tang, Yuto Mitsui, Tomo Miyazaki, Shinichiro Omachi

PDF

Open Access

TL;DR

This paper introduces a multi-masking strategy for masked image modeling that enhances both low-level and high-level textual representations, significantly improving performance on various text recognition tasks.

Contribution

It proposes a novel multi-masking strategy combining patch, blockwise, and span masking to better capture high-level contextual information in self-supervised text recognition.

Findings

01

Outperforms state-of-the-art self-supervised methods in text recognition

02

Improves performance in text segmentation and super-resolution tasks

03

Effectively captures both low-level textures and high-level context

Abstract

Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsMutual Information Machine/Mask Image Modeling · Contrastive Learning