Looking and Listening: Audio Guided Text Recognition
Wenwen Yu, Mingyu Liu, Biao Yang, Enming Zhang, Deqiang Jiang, Xing, Sun, Yuliang Liu, Xiang Bai

TL;DR
This paper introduces AudioOCR, a probabilistic audio decoder that guides scene text recognition during training, improving accuracy especially in challenging scenarios without adding inference costs.
Contribution
The authors propose AudioOCR, a training-only audio guidance method that enhances scene text recognition across various benchmarks and scenarios.
Findings
Consistent performance improvements on 12 benchmarks.
Effective in recognizing non-English and out-of-vocabulary words.
No additional inference cost introduced.
Abstract
Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Speech Recognition and Synthesis
