Decoder Pre-Training with only Text for Scene Text Recognition
Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang

TL;DR
This paper introduces DPTR, a novel scene text recognition pre-training method that leverages only text and CLIP-based embeddings to improve performance without relying on synthetic images.
Contribution
DPTR uses text embeddings as pseudo visual features and introduces ORP and FMU strategies, enabling effective pre-training solely from text for STR tasks.
Findings
DPTR outperforms existing methods on multiple benchmarks.
The approach reduces reliance on synthetic datasets.
It demonstrates broad applicability across various decoders.
Abstract
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsContrastive Language-Image Pre-training · ALIGN
