TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with   Pre-trained Language Model

Jiahao Lyu; Jin Wei; Gangyan Zeng; Zeng Li; Enze Xie; Wei Wang; Yu; Zhou

arXiv:2403.10047·cs.CV·March 18, 2024·1 cites

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu, Zhou

PDF

Open Access

TL;DR

This paper introduces TextBlockV2, a scene text spotting method that leverages pre-trained language models and block-level detection to improve recognition accuracy without relying on precise detection, inspired by human visual focus.

Contribution

It proposes a novel approach combining coarse block detection with PLM-based recognition, reducing the need for fine-grained detection and enhancing performance on complex scene texts.

Findings

01

Outperforms existing methods on multiple benchmarks

02

Handles complex scenarios like occlusion and multi-line texts effectively

03

Demonstrates potential of LLMs for scene text spotting from entire images

Abstract

Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques