Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
Zixiao Wang, Hongtao Xie, YuXin Wang, Yadong Qu, Fengjun Guo and, Pengwei Liu

TL;DR
This paper introduces TMIM, a novel weakly supervised pretraining method for scene text removal that leverages text localization data to improve performance and reduce reliance on costly pixel-level annotations.
Contribution
The paper proposes a new Text-aware Masked Image Modeling approach that enables direct, weakly supervised training of scene text removal models using only text detection labels.
Findings
Achieves state-of-the-art PSNR of 37.35 on SCUT-EnsText
Outperforms previous pretraining methods in scene text removal
Reduces dependence on expensive pixel-level annotations
Abstract
Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
