Turning a CLIP Model into a Scene Text Detector

Wenwen Yu; Yuliang Liu; Wei Hua; Deqiang Jiang; Bo Ren; Xiang Bai

arXiv:2302.14338·cs.CV·March 28, 2023

Turning a CLIP Model into a Scene Text Detector

Wenwen Yu, Yuliang Liu, Wei Hua, Deqiang Jiang, Bo Ren, Xiang Bai

PDF

Open Access 1 Repo

TL;DR

This paper introduces TCM, a novel approach that leverages the CLIP model directly for scene text detection, enhancing few-shot learning and domain adaptation without additional pretraining.

Contribution

It presents a new method to turn CLIP into a scene text detector, improving existing methods' performance and adaptability with minimal labeled data.

Findings

01

Significant performance boost with 10% labeled data (22% F-measure increase).

02

Enhanced domain adaptation capabilities.

03

Applicable to improve existing scene text detectors.

Abstract

The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve…

Tables15

Table 1. Table 1 : Text detection results of cooperating with existing methods on IC15, TD, and CTW. † indicates the results from [ 52 ] . Reg. and Seg. short for regression and segmentation methods, respectively. FPS are reported with ResNet50 backbone on a single V100.

	Method	IC15		TD		CTW		FPS
	Method	F	$Δ$	F	$Δ$	F	$Δ$	FPS
Reg.	FCENet [60]	86.2	-	85.4^†	-	85.5	-	11.5
Reg.	TCM-FCENet	87.1	+0.9	86.9	+1.5	85.9	+0.4	8.4
Seg.	PAN [39]	82.9	-	84.1	-	83.7	-	36
	TCM-PAN	84.6	+1.7	85.3	+1.2	84.3	+0.6	18
	DBNet [18]	87.3	-	84.9	-	83.4	-	14.5
	TCM-DBNet	89.2	+1.9	88.8	+3.9	84.9	+1.5	10

Table 2. Table 2 : Synthtext-to-real adaptation. † indicates the results from [ 42 ] . ST indicates SynthText. F-measure (%) is reported.

Method	ST $\to$ IC13	ST $\to$ IC15
EAST^† [58]	67.1	60.5
PAN [39]	-	54.8
CCN [44]	-	65.1
ST3D [16]	73.8	67.6
DBNet [18]	71.7	64.0
TCM-DBNet	79.6	76.7

Table 3. Table 3 : Real-to-real adaptation. † indicates that the results are from [ 52 ] . Note that the proposed method outperforms other methods. F-measure (%) is reported.

Method	IC13 $\to$ IC15	IC13 $\to$ TD
EAST^† [58]	53.3	46.8
GD(AD) [52]	64.4	58.5
GD(10-AD)[52]	69.4	62.1
CycleGAN [59]	57.2	-
ST-GAN [19]	57.6	-
CycleGAN+ST-GAN	60.8	-
TST [42]	52.4	-
DBNet [18]	63.9	53.8
TCM-DBNet	71.9	65.1

Table 4. Table 4 : Comparison with existing scene text pretraining techniques on DBNet (DB). † indicates the results from [ 31 ] . ST and VLP denote SynthText pretraining and visual-language pretraining methods, respectively. * stand for our reimplementation results. F-measure (%) is reported.

	Methods	Pretext task	IC15	TT	TD	CTW
Convention	SegLink [29]	$\times$	-	-	77.0	-
	PSENet-1s [14]	$\times$	85.7	80.9	-	82.2
	LOMO [53]	$\times$	87.2	81.6	-	78.4
	MOST [8]	$\times$	88.2	-	86.4	-
	Tang et al.[33]	$\times$	89.1	-	88.1	-
VLP	DB+ST^†	$\times$	85.4	84.7	84.9	-
	DB+STKM^† [37]	✓	86.1	85.5	85.9	-
	DB+VLPT^† [31]	✓	86.5	86.3	88.5	-
	DB+oCLIP* [48]	✓	-	-	-	84.4
	DB+TCM(Ours)	$\times$	89.4	85.9	88.8	85.1

Table 5. Table 5 : Ablation study of the ResNet50 backbone on IC15, TD, TT, and CTW. BB indicates Backbone. R50 and CR50 represent the ResNet50 backbones of the DBNet and the CLIP, respectively. F-measure (%) is reported.

Method	BB	IC15	TD	TT	CTW
DBNet	R50	87.3	84.9	84.7	83.4
DBNet	CR50	87.7 (+0.4)	86.8 (+1.9)	84.7	83.4

Table 6. Table 6 : Ablation study of our proposed components on IC15, TD, TT and CTW. “BSL”, “PP”, “LP”, “LG”, and “VG” represent the baseline method DBNet, the predefined prompt, the learnable prompt, the language prompt generator, and the visual prompt generator, respectively. F (%) represents F-measure. Δ Δ \Delta represents the variance.

Method	PP	LP	LG	VG	IC15	TD	TT	CTW
Method	PP	LP	LG	VG	F	F	F	F
BSL	$\times$	$\times$	$\times$	$\times$	87.7	86.8	84.7	83.4
BSL+	✓	$\times$	$\times$	$\times$	87.75	87.0	84.74	83.5
BSL+	✓	4	$\times$	$\times$	88.0	87.1	84.8	83.6
BSL+	$\times$	4	$\times$	$\times$	87.8	87.7	85.1	83.9
BSL+	$\times$	18	$\times$	$\times$	88.1	87.8	85.3	83.9
BSL+	$\times$	32	$\times$	$\times$	88.4	88.2	85.4	84.5
BSL+	✓	4	✓	$\times$	88.6	88.4	85.5	84.6
TCM	✓	4	✓	✓	89.2	88.8	85.6	84.9
TCM	✓	32	✓	✓	89.4	88.8	85.9	85.1
$Δ$					+1.7	+2.0	+1.2	+1.7

Table 7. Table 7 : Ablation study of the effect of LG and VG on generalization performance. F-measure (%) is reported.

Method	IC13 $\to$ IC15	IC13 $\to$ TD	IC15 $\to$ MLT17(en)	TT $\to$ ArT(-)	ST $\to$ IC13	ST $\to$ IC15
TCM	71.9	65.1	85.1	68.9	79.5	76.7
w/o VG	68.4 (-3.5)	59.4 (-5.7)	81.8 (-3.3)	59.1 (-9.8)	76.3 (-3.2)	72.6 (-4.1)
w/o LG	66.1 (-5.8)	56.8 (-8.3)	79.7 (-5.4)	57.8 (-11.1)	74.5 (-5.0)	68.2 (-8.5)
w/o VG & LG	64.8 (-7.1)	55.7 (-9.4)	78.4 (-6.7)	54.2 (-14.7)	71.7 (-7.8)	63.9 (-12.8)

Table 8. Table 8 : Ablation study of exploration on image encoder and text encoder. “LR” represents the learning rate.

	Image encoder	Text encoder	F (%)
LR Factor	0.1	0.0	88.7
	0.1	0.1	87.8
	0.1	1.0	87.1
	1.0	1.0	86.3

Table 9. Table 9 : Ablation study of exploration on large amounts of training data.

Method	Training Data	Testing Data	F (%)
FCENet	Joint data	NightTime-ArT	55.2
DBNet	Joint data	NightTime-ArT	52.8
TCM-DBNet	Joint data	NightTime-ArT	70.2

Table 10. Table 10 : Ablation study of the parameters comparison with DBNet.

Method	Backbone	Params	FLOPs	F (%)
DBNet	R50	26 (M)	98 (G)	84.9
DBNet	R101	46 (M)	139 (G)	85.9
DBNet	R152	62 (M)	180 (G)	87.3
TCM-DBNet	R50	50 (M)	156 (G)	88.7

Table 11. Table 11 : Ablation study of the auxiliary Loss.

Model	F (%)
TCM-DBNet with auxiliary loss	88.7
TCM-DBNet w/o auxiliary loss	85.1

Table 12. Table 12 : Real-to-real adaptation. F-measure (%) is reported.

Method	MLT17 $\to$ MLT19
DBNet [18]	47.4
TCM-DBNet	67.5

Table 13. Table 13 : Ablation study of the different predefined language prompt.

Predefined language prompt	IC15
“Text”	89.2
“A set of arbitrary-shape text instances”	89.0
“The pixels of many arbitrary-shape text instances”	88.9
without predefined language prompt	87.7

Table 14. Table 14 : Ablation study of training TCM-DBNet on IC15 with extra TextOCR data.

Model	Training data	F (%)
TCM-DBNet	IC15	89.2
TCM-DBNet	IC15+TextOCR	90.4

Table 15. Table 15 : Ablation study on CLIP backbone. R50 means ResNet50.

Model	Backbone	ST $\to$ IC13	ST $\to$ IC13
DBNet	R50	71.7	64.0
DBNet	CLIP-R50	73.1	67.4
TCM-DBNet	CLIP-R50	79.6	76.7

Equations22

I = ImageEncoder (I^{'}) .

I = ImageEncoder (I^{'}) .

t_{in}^{'} = WordEmbedding (Text) \in R^{D},

t_{in}^{'} = WordEmbedding (Text) \in R^{D},

t_{in} = [c_{1}, \dots, c_{n}, t_{in}^{'}] \in R^{(n + 1) \times D} .

t_{in} = [c_{1}, \dots, c_{n}, t_{in}^{'}] \in R^{(n + 1) \times D} .

t_{o u t} = TextEncoder (t_{in}) \in R^{C} .

t_{o u t} = TextEncoder (t_{in}) \in R^{C} .

\hat{t}_{in} = cc + t_{in} \in R^{(n + 1) \times D},

\hat{t}_{in} = cc + t_{in} \in R^{(n + 1) \times D},

cc = LN (σ (LN (\overset{ˉ}{I}) W_{1} + b_{1})) W_{2} + b_{2} \in R^{D},

cc = LN (σ (LN (\overset{ˉ}{I}) W_{1} + b_{1})) W_{2} + b_{2} \in R^{D},

\tilde{I} = TDec (Q = I, K = t_{o u t}, V = t_{o u t}) \in R^{\tilde{H} \times \tilde{W} \times C},

\tilde{I} = TDec (Q = I, K = t_{o u t}, V = t_{o u t}) \in R^{\tilde{H} \times \tilde{W} \times C},

\hat{I} = I + \tilde{I} .

\hat{I} = I + \tilde{I} .

P = sigmoid (\hat{I} t_{o u t}^{T} / τ) \in R^{\tilde{H} \times \tilde{W} \times 1},

P = sigmoid (\hat{I} t_{o u t}^{T} / τ) \in R^{\tilde{H} \times \tilde{W} \times 1},

L_{a ux} = i \sum \tilde{H} j \sum \tilde{W} y_{ij} lo g (P_{ij}) + (1 - y_{ij}) lo g (1 - P_{ij}),

L_{a ux} = i \sum \tilde{H} j \sum \tilde{W} y_{ij} lo g (P_{ij}) + (1 - y_{ij}) lo g (1 - P_{ij}),

L_{t o t a l} = L_{d e t} + λ L_{a ux},

L_{t o t a l} = L_{d e t} + λ L_{a ux},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenwenyu/tcm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training

Full text

Turning a CLIP Model into a Scene Text Detector

Wenwen Yu ,1, Yuliang Liu*∗,1*, Wei Hua1, Deqiang Jiang2, Bo Ren2, Xiang Bai*†,1*

1Huazhong University of Science and Technology 2Tencent YouTu Lab

{wenwenyu,ylliu,whua_hust,xbai}@hust.edu.cn, {dqiangjiang,timren}@tencent.com Equal contribution. †Corresponding author.

Abstract

The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM.

1 Introduction

Scene text detection is a long-standing research topic aiming to localize the bounding box or polygon of each text instance from natural images, as it has wide practical applications scenarios, such as office automation, instant translation, automatic driving, and online education. With the rapid development of fully-supervised deep learning technologies, scene text detection has achieved remarkable progresses. Although supervised approaches have made remarkable progress in the field of text detection, they require extensive and elaborate annotations, e.g., character-level, word-level, and text-line level bounding boxes, especially polygonal boxes for arbitrarily-shaped scene text. Therefore, it is very important to investigate text detection methods under small amount of labeled data, i.e., few-shot training.

Recently, through leveraging the pretrained vision and language knowledge, the large-scale Contrastive Language-Image Pretraining (CLIP) model [27] has demonstrated its significance in various downstream tasks. e.g., image classification [56], object detection [5], and semantic segmentation [28, 45, 13].

Compared to general object detection, scene text in natural images usually presents with both visual and rich character information, which has a natural connection with the CLIP model. Therefore, how to make full use of cross-modal information from visual, semantic, and text knowledge to improve the performance of the text detection models receives increasing attentions in recent studies. For examples, Song et al. [31], inspired by CLIP, adopts fine-grained cross-modality interaction to align unimodal embeddings for learning better representations of backbone via carefully designed pretraining tasks. Xue et al. [48] presents a weakly supervised pretraining method to jointly learn and align visual and partial textual information for learning effective visual text representations for scene text detection. Wan et al. [37] proposes self-attention based text knowledge mining to enhance backbone via an image-level text recognition pretraining tasks.

Different from these works, as shown in Figure 1, this paper focuses on turning the CLIP model for text detection without pretraining process. However, it is not trivial to incorporate the CLIP model into a scene text detector. The key is seeking a proper method to exploit the visual and semantic prior information conditioned on each image. In this paper, we develop a new method for scene text detection, termed as TCM, short for Turning a CLIP Model into a scene text detector, which can be easily plugged to improve the scene text detection frameworks. We design a cross-modal interaction mechanism through visual prompt learning, which is implemented by cross-attention to recover the locality feature from the image encoder of CLIP to capture fine-grained information to respond to the coarse text region for the subsequent matching between text instance and language. Besides, to steer the pretrained knowledge from the text encoder conditioned independently on different input images, we employ the predefined language prompt, learnable prompt, and a language prompt generator using simple linear layer to get global image information. In addition, we design an instance-language matching method to align the image embedding and text embedding, which encourages the image encoder to explicitly refine text regions from cross-modal visual-language priors. Compared to previous pretraining approaches, our method can be directly finetuned for the text detection task without pretraining process, as elaborated in Fig. 1. In this way, the text detector can absorb the rich visual or semantic information of text from CLIP. We summarize the advantages of our method as follows:

•

We construct a new text detection framework, termed as TCM, which can be easily plugged to enhance the existing detectors.

•

Our framework can enable effective few-shot training capability. Such advantage is more obvious when using less training samples compared to the baseline detectors. Specifically, by using 10% of labeled data, we improve the performance of the baseline detector by an average of 22% in terms of the F-measure on 4 benchmarks.

•

TCM introduces promising domain adaptation ability, i.e., when using training data that is out-of-distribution of the testing data, the performance can be significantly improved. Such phenomenon is further demonstrated by a NightTime-ArT text dataset111NightTime-ArT Download Link, which we collected from the ArT dataset.

•

Without pretraining process using specific pretext tasks, TCM can still leverage the prior knowledge from the CLIP model, outperforming previous scene text pretraining methods [37, 31, 48].

2 Related works

Unimodal Scene Text Detection.

Unimodal scene text detection represents the method directly adopts the bounding boxes annotation only [21]. It can be roughly divided into two categories: Segmentation-based methods and regression-based methods. The segmentation-based methods usually conduct pixel-level [18, 35, 47, 14, 39, 43, 17], segment-level [29, 23, 54, 1, 46, 34, 32, 51], or contour-level [38, 41] segmentation, then grouping segments into text instances via post-processing. The regression-based methods [60, 55, 9, 10, 15, 58, 8, 53, 40] regards text as a whole object and regress the bounding boxes of the text instances directly.

Cross-modal Assisted Scene Text Detection.

Unlike unimodal based scene text detection, cross-modal assisted scene text detection aims to make full use of cross-modal information including visual, semantic, and text knowledge to boost the performance. Wan et al. [37] utilized an image-level text recognition pretraining tasks to enhance backbone via the proposed self-attention based text knowledge mining mechanism. Song et al. [31], inspired by CLIP, designed three pretraining fine-grained cross-modality interaction tasks to align unimodal embeddings for learning better representations of backbone. Xue et al. [48] jointly learned and aligned visual and partial text instances information for learning effective visual text representations via the proposed weakly supervised pretraining method. Long et al. [22] proposed an end-to-end model to perform unified scene text detection and visual layout analysis simultaneously. The above methods explicitly leverage text or visual information to assist text detection. Instead, our method focuses on improving the performance results by turning a CLIP model into a scene text detector via leveraging pretrained text knowledge.

3 Methodology

We begin by illustrating the CLIP model which we used for fetching the prior knowledge. Next, we introduce the technical details of TCM as well as the rationale behind it. An overview of our approach is shown in Fig. 2.

3.1 Contrastive Language-Image Pretraining

CLIP [27], which collects 400 million image-text pairs without human annotation for model pretraining, has well demonstrated the potential of learning transferable knowledge and open-set visual concepts. Previous study [4] shows that different neurons in CLIP model can capture the corresponding concept literally, symbolically, and conceptually, As shown in Fig. 4, the CLIP model is an inborn text-friendly model which can effectively abstract the mapping space between image and text [26]. During training, CLIP learns a joint embedding space for the two modalities via a contrastive loss. Given a batch of image-text pairs, for each image, CLIP maximizes the cosine similarity with the matched text while minimizing that with all other unmatched text. For each text, the loss is computed similarly as each image. In this way, CLIP can be used for zero-shot image recognition [56]. However, to exploit the relevant information from such a model, there are two prerequisites: 1) A proper method to effectively request the prior knowledge from the CLIP. 2) The original model can only measure the similarity between an integrated image and a single word or sentence. For scene text detection, there are usually many text instances per image, which are all required to be recalled equivalently.

3.2 Turning a CLIP into a Text Detector

To turn the CLIP model into the scene text detector, we propose TCM, as shown in Fig. 2 and Fig.3. TCM is a pluggable module that can be directly applied to enhance the existing scene text detectors. It extracts the image and text embeddings from the image encoder and text encoder of CLIP model, respectively. We then design a cross-modal interaction mechanism through visual prompt learning to recover the locality feature from the image encoder of CLIP, which can capture fine-grained information to respond to the coarse text region for the subsequent matching between text instance and language. For better steering the pretrained knowledge, we introduce a language prompt generator to generate conditional cue for each image and design a visual prompt generator that learns image prompts for adapting the frozen clip text encoder for the text detection task. The TCM can be directly applicable to broader text detection methods only with some minor modifications. In addition, we design an instance-language matching method to align the image embedding and text embedding, which encourages the image encoder to explicitly refine text regions from cross-modal visual-language priors.

Image Encoder.

We use the pretrained ResNet50 [7] of CLIP as the image encoder, which produces an embedding vector for every input pixel. Given the input image $\bm{I}^{\prime}\in\mathbb{R}^{H\times W\times 3}$ , image encoder outputs image embedding $\bm{I}\in\mathbb{R}^{\tilde{H}\times\tilde{W}\times C}$ , where $\tilde{H}=\frac{H}{s}$ , $\tilde{W}=\frac{W}{s}$ , and $C$ is the image embedding dimension ( $C$ is set to 1024) and $s$ is the downsampling ratio (s is empirically set to 32), which can be expressed as:

[TABLE]

Text Encoder.

The text encoder takes input a number of of $K$ classes prompt and embeds it into a continuous vector space $\mathbb{R}^{C}$ , producing text embeddings $\bm{T}=\{\bm{t}_{1},\ldots,\bm{t}_{K}\}\in\mathbb{R}^{K\times C}$ as outputs of the text encoder, where $\bm{t}_{i}\in\mathbb{R}^{C}$ . Specifically, we leverage the frozen pretrained text encoder of CLIP throughout as the text encoder can provide language knowledge prior for text detection. $K$ is set to 1 because there is only one text class in text detection task. Different from the original model that uses templates like “a photo of a [CLS].”, we predefine discrete language prompt as “Text”. Then, a part of the text encoder input $\bm{t}_{in}^{\prime}$ is defined as follows:

[TABLE]

where $\operatorname{WordEmbedding}(\cdot)$ denotes word embedding for predefined prompt “Text” class. $D$ is the word embedding dimension and set to 512.

Inspired by CoOp [57, 56], we also add learnable prompt $\{\bm{c}_{1},\ldots,\bm{c}_{n}\}$ to learn robust transferability of text embedding for facilitating zero-shot transfer of CLIP model, where $n$ is the number of learnable prompt, which is set to 4 by default, and $\bm{c}_{i}\in\mathcal{R}^{D}$ . Thus, the input $\bm{t}_{in}$ of the text encoder is as follows:

[TABLE]

The text encoder takes $\bm{t}_{in}$ as input and generates text embedding $\bm{T}=\{\bm{t}_{1}\}\in\mathbb{R}^{C}$ , and $\bm{T}$ is donated by $\bm{t}_{out}\in\mathcal{R}^{C}$ for simplification:

[TABLE]

Language Prompt Generator.

Although the predefined prompt and learnable prompt are effective for steering the CLIP model, it may suffer from limited few-shot or generalization ability to open-ended scenarios where the testing text instance is out-of-distribution from the training images. To this end, we present a language prompt generator to generate a feature vector, termed as conditional cue ( $\bm{cc}$ ). For each image, the $\bm{cc}$ is then combined with the input of the text encoder $\bm{t}_{in}$ , formulated as follows:

[TABLE]

where $\hat{\bm{t}}_{in}$ is the new prompt input of the text encoder conditioned on the input image, and we replace $\bm{t}_{in}$ with $\hat{\bm{t}}_{in}$ in Eq. 4.

In practice, the language prompt generator is built with a two-layer feed-forward network, which is applied to generate conditional cue ( $\bm{cc}$ ) from the globality image embedding $\bm{I}$ . This consists of two layer normalization followed by linear transformations, with a ReLU activation in between, which is formulated as follows:

[TABLE]

where $\bm{\bar{I}}\in\mathbb{R^{C}}$ is the global image-level feature generated from image embedding $\bm{I}$ by the same global attention pooling layer as in CLIP. $\bm{W}_{1}\in\mathbb{R}^{C\times C}$ , $\bm{W}_{2}\in\mathbb{R}^{C\times D}$ , $\bm{b}_{1}\in\mathbb{R}^{C}$ , $\bm{b}_{2}\in\mathbb{R}^{D}$ , and we broadcast $\bm{cc}$ with $\bm{t}_{in}$ to get $\hat{\bm{t}}_{in}$ in Eq. 5.

Visual Prompt Generator.

We design a visual prompt generator to adaptively propagate fine-grained semantic information from textual features to visual features. Formally, we use the cross-attention mechanism in Transformer [36] to model the interactions between image embedding ( $\bm{Q}$ ) and text embedding ( $\bm{K}$ , $\bm{V}$ ). The visual prompt $\tilde{\bm{I}}$ is then learned for transferring the information prior from image-level to text instance-level, which is defined as:

[TABLE]

where TDec denotes the Transformer Decoder.

Based on the conditional visual prompt, the original image embedding $\bm{I}$ is equipped with $\tilde{\bm{I}}$ to produce the prompted text-aware locality embeddings $\hat{\bm{I}}$ used for instance-language matching (Eq. 9) and downstream detection head:

[TABLE]

Instance-language Matching.

Given the output of the text encoder and image encoder, we perform text instance-language matching alignment on text-aware locality image embedding $\hat{\bm{I}}$ and text embedding $\bm{t}_{out}$ by the dot product followed by sigmoid activation to get binary score map. The mixture of the generated conditional fine-grained embedding $\tilde{\bm{I}}$ and visual embedding $\bm{I}$ can allow text instance existing in visual features to be better matched with pretrained language knowledge in collaboration. The matching mechanism is formulated as follows:

[TABLE]

where $\bm{t}_{out}$ is text embedding because of only one text class in text detection scenarios, and $\bm{P}$ is the binary text segmentation map. The segmentation maps are supervised using the ground-truths as an auxiliary loss and concatenated by the prompted embedding $\hat{\bm{I}}$ for downstream text detection head to explicitly incorporate language priors for detection. During training, we minimize a binary cross-entropy loss between the segmentation map $\bm{P}$ and ground-truth, which is defined as follows:

[TABLE]

where $y_{ij}$ and $P_{ij}$ are the label and predicted probability of pixel $(i,j)$ belonging to the text instances, respectively.

Optimization.

The loss function $\mathcal{L}_{total}$ is the sum of detection loss $\mathcal{L}_{det}$ and auxiliary loss $\mathcal{L}_{aux}$ , formulated as follows:

[TABLE]

where $\lambda$ is a trade-off hyper-parameters and set to 1 in this paper. $\mathcal{L}_{det}$ depends on downstream text detection method including segmentation and regression categories. In the inference period, we use the output of the detection head as the final result.

4 Experiments

We conduct four sets of experiments to validate TCM. Our first set of experiment examines how TCM can be incorporated into existing text detectors to achieve consistent performance improvements. Next, we demonstrate the few-shot training capability and generalization ability by incorporating the TCM method. In the third set of experiments, we compare our method with previous pretraining methods. Finally, we provide thorough experiments to evaluate the sensitivity w.r.t. the proposed designs.

Datasets.

Our experiments are conducted on a number of commonly known scene text detection benchmarks including ICDAR2013 (IC13) [12], ICDAR2015 (IC15) [11], MSRA-TD500 (TD) [50], CTW1500 (CTW) [20], Total-Text (TT) [3], ArT [2], MLT17 [25], and MLT19 [24]. More details of the datasets refer to appendix.

Evaluation Metric.

We use intersection over union (IoU) to determine whether the model correctly detects the region of text, and we calculate precision (P), recall (R), and F-measure (F) for comparison following common practice [12]. For fair comparisons, text regions labeled with either “do not care” or “###” will be ignored in all datasets during training and testing.

Implementation Details.

For text detection tasks, we experiment with the popular text detection methods including DBNet (DB) [18]222https://github.com/MhLiao/DB, PAN [39]333https://github.com/whai362/pan_pp.pytorch, and FCENet (FCE) [60]444https://github.com/open-mmlab/mmocr/tree/main/configs/textdet/fcenet to evaluate TCM. For consistent settings with these methods, we train the detector using both SynthText and the real datasets. Specifically, the backbone is instantiated with the pretrained image encoder ResNet50 [7] of the CLIP unless specified. The visual prompt generator has 3 transformer decoder layers with 4 heads; transformer width is 256; and the feed-forward hidden dimension is set to 1024. We use the corresponding detection head of the DBNet, PAN, and FCENet to predict the final results. For testing few-shot learning of model, we directly train on the benchmark with different proportions of training data without pretraining and test it on the corresponding test data. For testing the generalization ability, we use the model trained on the corresponding source datasets and evaluating it on the target dataset that has dissimilar distribution. We consider two kinds of adaptation including synthtext-to-real and real-to-real, to validate the domain adaptation of the TCM. The ablation studies are conducted w.r.t. the predefined prompt, the learnable prompt, the language prompt generator, the visual prompt generator, and the different settings. The DBNet is used as baseline for TCM.

4.1 Cooperation with Existing Methods

We report the text detection results of our TCM combined with three text detection methods on IC15, TD, and CTW in Table 1. Our method is +0.9%, +1.7%, and +1.9% higher than the original FCENet, PAN, and DBNet, respectively, in terms of F-measure on IC15. TD and CTW also have similar consistent improvement. Note that the inference speed of our method is 18, 8.4, and 10 FPS evaluated on IC15, TD, and CTW datasets, respectively, with PAN, FCENet, and DBNet, remaining the high efficiency of the detector.

We visualize our method in Fig. 7. It shows that the fine-grained features $\tilde{\bm{I}}$ containing text information is recovered from the global image embedding $\bm{I}$ , demonstrating that TCM can identify text regions and provide this prior cues for downstream text detection.

4.2 Few-shot Training Ability

To further verify the few-show training ability of our method, we directly train our model on real datasets using various training data ratio without pretraining, and evaluate it on the corresponding 4 benchmarks. As shown in Fig. 5, our method shows robust on limited data and outperforms the three baseline methods including DB, PAN and EAST [58]. The results show that the TCM can capture the inherent characteristic of text via leveraging the pretrained vision and language knowledge of the zero-shot trained CLIP model.

4.3 Generalization Ability

We conduct two types of experiments including synthtext-to-real adaptation and real-to-real adaptation, as shown in Table 2 and Table 3, respectively. From the tables, we can see that by plugging the TCM to DBNet, we significantly improve the performance by an average of 8.2% in terms of F-measure for four different settings including synthtext-to-real and real-to-real, which further demonstrates the effectiveness of our method for domain adaptation.

4.4 Comparison with Pretraining Methods

The pretraining methods based on specifically designed pretext tasks has made effective progress in the field of text detection. In contrast to these efforts, TCM can turn the CLIP model directly into a scene text detector without pretraining process. The comparison results are shown in Table 4, from which we can see that without pretext tasks for pretraining, DB+TCM consistently outperforms previous methods including DB+STKM [37], DB+VLPT [31], and DB+oCLIP [48]. Especially on IC15, our method outperforms previous state-of-the-art pretraining method by a large margin, with 89.4% versus 86.5% in terms of the F-measure.

4.5 Ablation Studies

Pretrained CLIP Backbone. First, we conduct experiments that we only replace the original backbone of the DBNet with the pretrained image encoder ResNet50 of the CLIP to quantify the performance variance of the backbones. As shown in Table 5, the original pretrained model of CLIP is insufficient for leveraging the visual-language knowledge of the CLIP. Therefore, it is necessary to use a proper method to excavate the knowledge of the CLIP model.

Ablation Study for the Predefined Prompt. When using the predefined prompt, as illustrated in the second row of Table 6, the performances are slightly improved on all four datasets (IC15, TD, TT, and CTW), with 0.05%, 0.2%, 0.04%, and 0.1% higher than the baseline method, respectively.

Ablation Study for the Learnable Prompt. Besides, results combing the learnable prompt with the predefined prompt on four datasets are provided in the third row of Table 6. We notice that a consistent improvement can be achieved by adding the learnable prompt. We also show the influence of using different numbers of the learnable prompt in row 4 to row 6 of Table 6. We observe that as the value of the number of the learnable prompt increases, the performance increases gradually on all datasets. Compared to the value 4, the value 32 obtains obvious improvements on CTW, TD, and TT. We conjecture that this is because the larger number of the learnable prompt can better steer the pretrained text encoder knowledge which is useful for text detection. In the following experiments, the default number of the learnable prompt is set to 4 for simplicity.

Ablation Study for the Language Prompt Generator. Furthermore, we evaluate the performance of the proposed language prompt generator shown in 7th row of Table 6. With the help of the language prompt generator, we find that TCM achieves further improvements on all four datasets, especially on ICDAR2015, indicating that the conditional cue generated by the language prompt generator for each image can ensure better generalization over different types of datasets.

Ablation Study for the Visual Prompt Generator. Finally, combining the proposed visual prompt generator with the above other components, the improvement of F-measure is better than the baseline on all four datasets, with larger margins of 1.7% and 2.0% on IC15 and TD, respectively. The reason for this obvious complementary phenomenon is that the visual prompt generator can propagate fine-grained visual semantic information from textual features to visual features. Besides, the prompted locality image embedding generated by the visual prompt generator can guide the model to obtain more accurate text instance-level visual representations, which boosts the ability of instance-language matching and generates a precise segmentation score map that is useful for downstream detection head.

Ablation Study for the VG and LG on Generalization Performance. As described in Table 7, removing the VG and LG elements from TCM dramatically deteriorates the generalization performance, which further indicates the effectiveness of the VG and LG.

Ablation Study for Image Encoder and Text Encoder. We have investigated how the quality of the frozen text encoder and image encoder affects the performance via adjusting the corresponding learning rate (LR) factor. The experimental results of TCM-DBNet on the TD500 dataset are shown in Table 8. The results show that using a lower learning rate for both encoders and fixing the text encoder is the optimal setting for training the whole model. Note that we observe performance degradation when directly using $1.0\times$ learning rate for both encoders, which suggests the frozen text encoder can stabilize the training process. The cores of the architecture, including the language prompt generator and visual prompt generator, are designed to better steer knowledge of the pretrained CLIP. Appropriate design of the network architecture and the use of the pretrained CLIP are complementary.

Ablation Study for Different Amount of Data. To further explore whether the TCM can learn the additional knowledge which is hard to be obtained from increasing data, we have trained the model on a large-scale public joint data including IC13, IC15, TD, CTW, TT, and MLT17, with total 13,784 image, and testing it on a NightTime-ArT data (326 images) carefully collected from ArT. The nighttime examples of ArT are shown in Fig. 6. Results are shown in Table 9. The results show that even with the addition of large amounts of training data, existing methods still show limitation to the nighttime data that is obviously out-of-distribution from the training set. However, TCM can still perform robust in such case, indicating its irreplaceable potential generalization ability.

Ablation Study for the Parameters Comparison. For a fair comparison, we have increased the parameters of DBNet by replacing the backbone with a larger ResNet and then conduct experiments on TD500 dataset. Trainable parameters and FLOPs are calculated with an input size 1280 $\times$ 800. Results are shown in Table 10. The results show that TCM-DBNet has better performance than DBNet with less model size and computation overhead, demonstrating its effectiveness for scene text detection.

Ablation Study for the Auxiliary Loss. We further compare the results of with and without auxiliary loss on TD500 dataset, as shown in Table 11. We see that using auxiliary loss achieves higher performance. The results indicate auxiliary loss is beneficial to train the model via imposing constraints on instance-language matching score map. In addition, the improvement of the performance suggests that it might help the image encoder of pretrained CLIP to perceive locality text region effectively.

5 Discussion of Failure Cases

There are some insightful failed cases as shown in Figure 8. The instance-language matching score map generates false positive region that is very similar to the characteristics of text, as shown in the region of the red circle in Fig. 8, which will be considered as noise. Therefore, it is necessary that the downstream text detection head can further refine this initial score map instead of directly using the score map of instance-language matching as the final results. We leave this problem as future work to alleviate the false positive score map of instance-language matching.

6 Conclusion

This paper proposes the TCM, which can directly excavate the prior knowledge from the CLIP model into a scene text detector without pretraining process. Such a new text detection paradigm reveals the importance of using visual-language prior for seeking information from the zero-shot off-the-rack model, and thus guiding the text detector adapting to small-scale data, divergent data distribution, and complicated scenes, without relying on carefully-designed pretraining tasks. Experiments comprehensively demonstrate the effectiveness of our method. It is worth mentioning that we also construct a NightTime-ArT dataset to further demonstrate that the TCM can steer useful prior knowledge from the CLIP model. As the CLIP model is an inborn-friendly framework for text, extension of TCM to scene text spotting is also a promising direction for future work.

Acknowledgements This work was supported by the National Natural Science Foundation of China (No.62225603, No.6220073278, No.62206103), and the National Key Research and Development Program (No.2022YFC3301703, No.2022YFC2305102).

Appendix A Appendix

A.1 Datasets

ICDAR2013 [12] is high-resolution English dataset for focused scene text detection, including 229 images for training and 233 images for testing.

ICDAR2015 [11] is a multi-oriented text detection dataset for English text that includes 1,000 training images and 500 testing images. Scene text images in this dataset were taken by Google Glasses without taking care of positioning, image quality, and viewpoint.

MSRA-TD500 [50] is a multi-language dataset that includes English and Chinese, including 300 training images and 200 testing images. We also include extra 400 training images from HUST-TR400 [49] following the previous methods [18, 58].

CTW1500 [20] consists of 1,000 training images and 500 testing images which focuses on the curved text. The text instances are annotated in the text-line level by polygons with 14 vertices.

Total-Text [3] contains 1,255 training images and 300 testing images. The text instances are labeled at the word level. It includes horizontal, multi-oriented, and curved text shapes.

ArT [2] includes 5,603 training images and 4,563 testing images. It is a large-scale multi-lingual arbitrary-shape scene text detection dataset. The text regions are annotated by the polygons with an adaptive number of key points. Note that it contains Total-Text and CTW1500.

MLT17 [25] includes 9 languages text representing 6 different scripts annotated by quadrangle. It has 7,200 training images, 1,800 validation images, and 9,000 testing images. We use both the training set and the validation set in the finetune period.

MLT19 [24] is a large-scale multi-lingual scene text detection datasets. It contains 10,000 training images and 10,000 testing images, and labeled at word level.

SynthText [6] It contains 800k synthetic images generated by blending natural images with artificial text, which are all word-level annotated.

TextOCR [30] is a large-scale high quality scene text datasets collected from Open Images555https://storage.googleapis.com/openimages/web/index.html. It contains 30 words on average per image. It has 24,902 training images and 3,232 testing images, and is annotated with polygons.

A.2 More Quantitative Results

Multi-lingual Real-to-real Adaptation. We conducted multi-lingual generalization ability experiments as shown in Table 12. The results show that the pluggable TCM can also benefit to multi-lingual scenarios text detection via leveraging the pretrained knowledge of CLIP, which demonstrates the effectiveness of our method for domain adaptation.

Ablation Study for the Different Predefined Language Prompt. We conducted ablation study on the predefined language prompt with different string using TCM-DBNet in Table 13. Results show that without predefined language prompt the performance is harmed. In addition, it can be seen that there is little performance variation with different predefined language prompt.

Ablation Study for Training with Large-scale Dataset. We conducted ablation study of training TCM-DBNet on IC15 with extra TextOCR [30] data. As shown in Table 14, when using additional large-scale TextOCR as training data, our model can achieve further improvement, suggesting the compatibility of our method with large-scale datasets.

Ablation study for CLIP Backbone Generalization. We conducted ablation study to investigate the generalization performance of DBNet by directly replacing the backbone of DBNet with CLIP backbone, as shown in Table 15. It shows that the CLIP-R50 can indeed bring benefit for generalization. However, integrating with TCM, the performance can be significantly improved. It suggests that directly using the pre-trained CLIP-R50 is not strong enough to improve the generalization performance of the existing text detector, which further indicates that synergistic interaction between the detector and the CLIP is important.

A.3 More Visualization Results

Conditional Cue. We visualize the t-SNE of the generated conditional cue ( $\mathbf{cc}$ ) on six datasets, as illustrated in Fig. 9. The structured distribution indicates our model has learned the distribution of every domain dataset in high-dimensional feature space, which is useful for improving the generalization ability.

Visual Prompt. Fig. 10 - Fig. 13 are more qualitative results of the image embedding $\bm{I}$ and the generated visual prompt $\tilde{\bm{I}}$ on CTW1500, Total-Text, MSRA-TD500, and ICDAR2015, respectively. The visual prompt $\tilde{\bm{I}}$ has contains fine-grained information of text regions.

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9357–9366, 2019.
2[2] Chee-Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, Chuan Ming Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. ICDAR 2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-Ar T). In ICDAR , pages 1571–1576, 2019.
3[3] Chee-Kheng Ch’ng, Chee Seng Chan, and Cheng-Lin Liu. Total-text: toward orientation robustness in scene text detection. IJDAR , pages 1–22, 2019.
4[4] Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill , 6(3):e 30, 2021.
5[5] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR , pages 1–20, 2022.
6[6] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2315–2324, 2016.
7[7] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770–778, 2016.
8[8] Minghang He, Minghui Liao, Zhibo Yang, Humen Zhong, Jun Tang, Wenqing Cheng, Cong Yao, Yongpan Wang, and Xiang Bai. Most: A multi-oriented scene text detector with localization refinement. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 8809–8818, 2021.