Self-supervised Pre-training of Text Recognizers

Martin Ki\v{s}\v{s}; Michal Hradi\v{s}

arXiv:2405.00420·cs.CV·May 2, 2024

Self-supervised Pre-training of Text Recognizers

Martin Ki\v{s}\v{s}, Michal Hradi\v{s}

PDF

Open Access 1 Repo

TL;DR

This paper explores self-supervised pre-training methods for document text recognition, demonstrating their effectiveness on historical datasets and highlighting their potential as a foundation for future research in the field.

Contribution

It introduces novel self-supervised pre-training techniques for text recognition and evaluates their performance, providing insights into their benefits and limitations compared to transfer learning.

Findings

01

Self-supervised pre-training improves recognition performance on target domain data.

02

Pre-training on target domain data is more effective than transfer learning from related domains.

03

Joint-embedding approaches with VICReg and NT-Xent are promising but face challenges like model collapse.

Abstract

In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcgm/pero-pretraining
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsNormalized Temperature-scaled Cross Entropy Loss · Autoencoders · VQ-VAE