Language Matters: A Weakly Supervised Vision-Language Pre-training   Approach for Scene Text Detection and Spotting

Chuhui Xue; Wenqing Zhang; Yu Hao; Shijian Lu; Philip Torr; Song Bai

arXiv:2203.03911·cs.CV·November 15, 2022

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Chuhui Xue, Wenqing Zhang, Yu Hao, Shijian Lu, Philip Torr, Song Bai

PDF

Open Access

TL;DR

This paper introduces oCLIP, a weakly supervised vision-language pre-training approach that enhances scene text detection and spotting by jointly learning visual and textual features, even with limited annotations.

Contribution

The paper presents a novel weakly supervised pre-training method for scene text tasks, enabling effective learning from partial annotations and improving downstream performance.

Findings

01

Improves F-score by +2.5% and +4.8% on detection and spotting tasks.

02

Outperforms existing pre-training methods on multiple datasets.

03

Enables learning from weakly annotated images without bounding boxes.

Abstract

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques