TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model
Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu, Zhou

TL;DR
This paper introduces TextBlockV2, a scene text spotting method that leverages pre-trained language models and block-level detection to improve recognition accuracy without relying on precise detection, inspired by human visual focus.
Contribution
It proposes a novel approach combining coarse block detection with PLM-based recognition, reducing the need for fine-grained detection and enhancing performance on complex scene texts.
Findings
Outperforms existing methods on multiple benchmarks
Handles complex scenarios like occlusion and multi-line texts effectively
Demonstrates potential of LLMs for scene text spotting from entire images
Abstract
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques
