Vision-Language Pre-Training for Boosting Scene Text Detectors
Sibo Song, Jianqiang Wan, Zhibo Yang, Jun Tang, Wenqing Cheng, Xiang, Bai, Cong Yao

TL;DR
This paper introduces a vision-language pre-training approach that enhances scene text detectors by learning joint representations through contrastive learning, masked language modeling, and word-in-image prediction, significantly improving detection accuracy.
Contribution
It proposes a novel vision-language pre-training architecture with three pretext tasks specifically designed to boost scene text detection performance.
Findings
Significant performance improvements on standard benchmarks.
Outperforms previous pre-training methods.
Enhances existing detectors like EAST and PSENet.
Abstract
Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsContrastive Learning
