Vision Transformer for Fast and Efficient Scene Text Recognition

Rowel Atienza

arXiv:2105.08582·cs.CV·May 19, 2021

Vision Transformer for Fast and Efficient Scene Text Recognition

Rowel Atienza

PDF

3 Repos

TL;DR

This paper introduces ViTSTR, a vision transformer-based model for scene text recognition that achieves a strong balance of accuracy, speed, and efficiency, making it suitable for energy-constrained mobile applications.

Contribution

ViTSTR is a simple, single-stage model built on a compute-efficient vision transformer, offering competitive accuracy with significantly improved speed and reduced computational requirements.

Findings

01

ViTSTR achieves 82.6% accuracy at 2.4x speed-up with fewer parameters.

02

Tiny ViTSTR reaches 80.3% accuracy at 2.5x speed-up with minimal parameters.

03

ViTSTR outperforms baseline methods in accuracy and efficiency trade-offs.

Abstract

Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy of 82.6% (84.2% with data augmentation) at 2.4x speed up, using only 43.4% of the number of parameters and 42.2% FLOPS. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Softmax · Dense Connections · Vision Transformer