TL;DR
This paper introduces ViTSTR, a vision transformer-based model for scene text recognition that achieves a strong balance of accuracy, speed, and efficiency, making it suitable for energy-constrained mobile applications.
Contribution
ViTSTR is a simple, single-stage model built on a compute-efficient vision transformer, offering competitive accuracy with significantly improved speed and reduced computational requirements.
Findings
ViTSTR achieves 82.6% accuracy at 2.4x speed-up with fewer parameters.
Tiny ViTSTR reaches 80.3% accuracy at 2.5x speed-up with minimal parameters.
ViTSTR outperforms baseline methods in accuracy and efficiency trade-offs.
Abstract
Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy of 82.6% (84.2% with data augmentation) at 2.4x speed up, using only 43.4% of the number of parameters and 42.2% FLOPS. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Softmax · Dense Connections · Vision Transformer
