Hear the Scene: Audio-Enhanced Text Spotting

Jing Li; Bo Wang

arXiv:2412.19504·cs.CV·April 23, 2025

Hear the Scene: Audio-Enhanced Text Spotting

Jing Li, Bo Wang

PDF

Open Access

TL;DR

This paper presents a novel text spotting method that relies solely on transcription annotations, reducing annotation costs and incorporating audio-based labeling, while achieving competitive accuracy.

Contribution

It introduces a query-based weakly-supervised text spotting framework that uses only transcription annotations and includes a circular curriculum learning strategy and audio annotation support.

Findings

01

Achieves competitive performance without location annotations

02

Reduces annotation effort by using transcription and audio labels

03

Employs a novel coarse-to-fine cross-attention localization mechanism

Abstract

Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Humanities and Scholarship

MethodsSoftmax · Attention Is All You Need