Decoder Pre-Training with only Text for Scene Text Recognition

Shuai Zhao; Yongkun Du; Zhineng Chen; Yu-Gang Jiang

arXiv:2408.05706·cs.CV·August 13, 2024

Decoder Pre-Training with only Text for Scene Text Recognition

Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces DPTR, a novel scene text recognition pre-training method that leverages only text and CLIP-based embeddings to improve performance without relying on synthetic images.

Contribution

DPTR uses text embeddings as pseudo visual features and introduces ORP and FMU strategies, enabling effective pre-training solely from text for STR tasks.

Findings

01

DPTR outperforms existing methods on multiple benchmarks.

02

The approach reduces reliance on synthetic datasets.

03

It demonstrates broad applicability across various decoders.

Abstract

Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

topdu/openocr
pytorchOfficial

Models

🤗
topdu/OpenOCR
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsContrastive Language-Image Pre-training · ALIGN