SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du; Zhineng Chen; Hongtao Xie; Caiyan Jia; Yu-Gang Jiang

arXiv:2411.15858·cs.CV·July 16, 2025

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

SVTRv2 is a scene text recognition model that outperforms encoder-decoder models by effectively handling text irregularities and leveraging linguistic context, while maintaining fast inference speed.

Contribution

The paper introduces SVTRv2, a CTC-based scene text recognition model with novel modules for text irregularity handling and linguistic context integration, improving accuracy without increasing inference time.

Findings

01

SVTRv2 surpasses most encoder-decoder models in accuracy.

02

SVTRv2 maintains fast inference speed.

03

The model performs well across various text irregularities and languages.

Abstract

Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. First, a multi-size resizing strategy is proposed to resize text instances to appropriate predefined sizes, effectively avoiding severe text distortion. Meanwhile, we introduce a feature rearrangement module to ensure that visual features accommodate the requirement of CTC, thus alleviating the alignment puzzle.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

topdu/openocr
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Handwritten Text Recognition Techniques · Image Processing and 3D Reconstruction