Vision-Language Adaptive Mutual Decoder for OOV-STR
Jinshui Hu, Chenyu Liu, Qiandong Yan, Xuyang Zhu, Jiajia Wu, Jun Du,, Lirong Dai

TL;DR
This paper introduces VLAMD, a novel framework combining vision and language models with bidirectional training and mutual decoding to improve out-of-vocabulary scene text recognition, achieving state-of-the-art results.
Contribution
The paper proposes a new Vision-Language Adaptive Mutual Decoder (VLAMD) that effectively handles OOV scene text recognition by integrating visual and language models with mutual decoding.
Findings
Achieved 70.31% word accuracy on IV+OOV setting.
Achieved 59.61% word accuracy on OOV setting.
Secured 1st place in ECCV 2022 OOV-ST Challenge.
Abstract
Recent works have shown huge success of deep learning models for common in vocabulary (IV) scene text recognition. However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings. Inspired by the intuition that the learned language prior have limited OOV preformence, we design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle OOV problems partly. VLAMD consists of three main conponents. Firstly, we build an attention based LSTM decoder with two adaptively merged visual-only modules, yields a vision-language balanced main branch. Secondly, we add an auxiliary query based autoregressive transformer decoding head for common visual and language prior representation learning. Finally, we couple these two designs with bidirectional training for more diverse language modeling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
