Comparative analysis of optical character recognition methods for S\'ami   texts from the National Library of Norway

Tita Enstad; Trond Trosterud; Marie Iversdatter R{\o}sok and; Yngvil Beyer; Marie Roald

arXiv:2501.07300·cs.CL·January 14, 2025

Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway

Tita Enstad, Trond Trosterud, Marie Iversdatter R{\o}sok and, Yngvil Beyer, Marie Roald

PDF

2 Repos 7 Models

TL;DR

This study evaluates and fine-tunes OCR methods for Sámí texts from Norway's National Library, demonstrating that tailored approaches improve accuracy, with Transkribus and TrOCR outperforming Tesseract in-domain.

Contribution

It introduces a comparative evaluation and fine-tuning of OCR models specifically for Sámí languages, highlighting effective strategies with limited manual annotations.

Findings

01

Transkribus and TrOCR outperform Tesseract on Sámí texts.

02

Fine-tuning and synthetic data improve OCR accuracy with limited manual annotations.

03

Tesseract performs better on out-of-domain datasets.

Abstract

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDense Connections · Residual Connection · Softmax · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Lib · TrOCR