Context Perception Parallel Decoder for Scene Text Recognition
Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin and, Chenxia Li, Yuning Du, Yu-Gang Jiang

TL;DR
This paper introduces the Context Perception Parallel Decoder (CPPD) for scene text recognition, which combines the speed of parallel decoding with improved accuracy by modeling linguistic and visual context.
Contribution
The paper proposes a novel CPPD model that enhances parallel decoding in scene text recognition by integrating context perception modules, achieving high accuracy and fast inference.
Findings
CPPD achieves comparable accuracy to autoregressive models.
CPPD runs approximately 8 times faster than AR-based models.
Plugging modules into existing decoders improves their accuracy.
Abstract
Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based models implement the recognition in a character-by-character manner, showing superiority in accuracy but with slow inference speed. Alternatively, parallel decoding (PD)-based models infer all characters in a single decoding pass, offering faster inference speed but generally worse accuracy. We first present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception. Consequently, we propose Context Perception Parallel Decoder (CPPD) to predict the character sequence in a PD pass. CPPD devises a character counting module to infer the occurrence count of each character, and a character ordering module to deduce the content-free reading order and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
