Vision-Language Pre-Training for Boosting Scene Text Detectors

Sibo Song; Jianqiang Wan; Zhibo Yang; Jun Tang; Wenqing Cheng; Xiang; Bai; Cong Yao

arXiv:2204.13867·cs.CV·May 2, 2022·1 cites

Vision-Language Pre-Training for Boosting Scene Text Detectors

Sibo Song, Jianqiang Wan, Zhibo Yang, Jun Tang, Wenqing Cheng, Xiang, Bai, Cong Yao

PDF

Open Access 2 Repos

TL;DR

This paper introduces a vision-language pre-training approach that enhances scene text detectors by learning joint representations through contrastive learning, masked language modeling, and word-in-image prediction, significantly improving detection accuracy.

Contribution

It proposes a novel vision-language pre-training architecture with three pretext tasks specifically designed to boost scene text detection performance.

Findings

01

Significant performance improvements on standard benchmarks.

02

Outperforms previous pre-training methods.

03

Enhances existing detectors like EAST and PSENet.

Abstract

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques

MethodsContrastive Learning