Looking and Listening: Audio Guided Text Recognition

Wenwen Yu; Mingyu Liu; Biao Yang; Enming Zhang; Deqiang Jiang; Xing; Sun; Yuliang Liu; Xiang Bai

arXiv:2306.03482·cs.CV·June 7, 2023·1 cites

Looking and Listening: Audio Guided Text Recognition

Wenwen Yu, Mingyu Liu, Biao Yang, Enming Zhang, Deqiang Jiang, Xing, Sun, Yuliang Liu, Xiang Bai

PDF

Open Access 1 Repo

TL;DR

This paper introduces AudioOCR, a probabilistic audio decoder that guides scene text recognition during training, improving accuracy especially in challenging scenarios without adding inference costs.

Contribution

The authors propose AudioOCR, a training-only audio guidance method that enhances scene text recognition across various benchmarks and scenarios.

Findings

01

Consistent performance improvements on 12 benchmarks.

02

Effective in recognizing non-English and out-of-vocabulary words.

03

No additional inference cost introduced.

Abstract

Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenwenyu/audioocr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Speech Recognition and Synthesis