HTR-VT: Handwritten Text Recognition with Vision Transformer
Yuting Li, Dexiong Chen, Tinglong Tang, Xi Shen

TL;DR
This paper introduces a data-efficient Vision Transformer approach for handwritten text recognition that leverages CNN feature extraction, a span mask regularizer, and the SAM optimizer, achieving competitive results on small datasets and setting a new benchmark on the large LAM dataset.
Contribution
The paper proposes a novel ViT-based method with CNN features, span mask regularization, and SAM optimizer for improved handwritten text recognition on limited data.
Findings
Competitive performance on IAM and READ2016 datasets.
Establishes a new benchmark on the LAM dataset.
Effective regularization with span mask technique.
Abstract
We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous transformer-based models required external data or extensive pre-training on large datasets to excel. To address this limitation, we introduce a data-efficient ViT method that uses only the encoder of the standard transformer. We find that incorporating a Convolutional Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness-Aware Minimization (SAM) optimizer to ensure that the model can converge towards flatter minima and yield notable enhancements. Furthermore, our introduction of the span mask technique, which masks interconnected features in the feature map, acts as an effective regularizer. Empirically, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection
