HTR-VT: Handwritten Text Recognition with Vision Transformer

Yuting Li; Dexiong Chen; Tinglong Tang; Xi Shen

arXiv:2409.08573·cs.CV·September 16, 2024

HTR-VT: Handwritten Text Recognition with Vision Transformer

Yuting Li, Dexiong Chen, Tinglong Tang, Xi Shen

PDF

2 Repos 1 Datasets

TL;DR

This paper introduces a data-efficient Vision Transformer approach for handwritten text recognition that leverages CNN feature extraction, a span mask regularizer, and the SAM optimizer, achieving competitive results on small datasets and setting a new benchmark on the large LAM dataset.

Contribution

The paper proposes a novel ViT-based method with CNN features, span mask regularization, and SAM optimizer for improved handwritten text recognition on limited data.

Findings

01

Competitive performance on IAM and READ2016 datasets.

02

Establishes a new benchmark on the LAM dataset.

03

Effective regularization with span mask technique.

Abstract

We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous transformer-based models required external data or extensive pre-training on large datasets to excel. To address this limitation, we introduce a data-efficient ViT method that uses only the encoder of the standard transformer. We find that incorporating a Convolutional Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness-Aware Minimization (SAM) optimizer to ensure that the model can converge towards flatter minima and yield notable enhancements. Furthermore, our introduction of the span mask technique, which masks interconnected features in the feature map, acts as an effective regularizer. Empirically, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

felixleungsc/paperswithcode-data-evaluation-tables
dataset· 203 dl
203 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection