SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang

TL;DR
SVTRv2 is a scene text recognition model that outperforms encoder-decoder models by effectively handling text irregularities and leveraging linguistic context, while maintaining fast inference speed.
Contribution
The paper introduces SVTRv2, a CTC-based scene text recognition model with novel modules for text irregularity handling and linguistic context integration, improving accuracy without increasing inference time.
Findings
SVTRv2 surpasses most encoder-decoder models in accuracy.
SVTRv2 maintains fast inference speed.
The model performs well across various text irregularities and languages.
Abstract
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. First, a multi-size resizing strategy is proposed to resize text instances to appropriate predefined sizes, effectively avoiding severe text distortion. Meanwhile, we introduce a feature rearrangement module to ensure that visual features accommodate the requirement of CTC, thus alleviating the alignment puzzle.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Handwritten Text Recognition Techniques · Image Processing and 3D Reconstruction
