Hear the Scene: Audio-Enhanced Text Spotting
Jing Li, Bo Wang

TL;DR
This paper presents a novel text spotting method that relies solely on transcription annotations, reducing annotation costs and incorporating audio-based labeling, while achieving competitive accuracy.
Contribution
It introduces a query-based weakly-supervised text spotting framework that uses only transcription annotations and includes a circular curriculum learning strategy and audio annotation support.
Findings
Achieves competitive performance without location annotations
Reduces annotation effort by using transcription and audio labels
Employs a novel coarse-to-fine cross-attention localization mechanism
Abstract
Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship
MethodsSoftmax · Attention Is All You Need
