TL;DR
This study evaluates and fine-tunes OCR methods for Sámí texts from Norway's National Library, demonstrating that tailored approaches improve accuracy, with Transkribus and TrOCR outperforming Tesseract in-domain.
Contribution
It introduces a comparative evaluation and fine-tuning of OCR models specifically for Sámí languages, highlighting effective strategies with limited manual annotations.
Findings
Transkribus and TrOCR outperform Tesseract on Sámí texts.
Fine-tuning and synthetic data improve OCR accuracy with limited manual annotations.
Tesseract performs better on out-of-domain datasets.
Abstract
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Sprakbanken/trocr_smi_normodel· 3 dl3 dl
- 🤗Sprakbanken/trocr_smimodel· 3 dl3 dl
- 🤗Sprakbanken/trocr_smi_nor_predmodel· 3 dl3 dl
- 🤗Sprakbanken/trocr_smi_synthmodel· 2 dl2 dl
- 🤗Sprakbanken/trocr_smi_predmodel· 3 dl3 dl
- 🤗Sprakbanken/trocr_smi_nor_pred_synthmodel· 101 dl101 dl
- 🤗Sprakbanken/trocr_smi_pred_synthmodel· 16 dl16 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDense Connections · Residual Connection · Softmax · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Lib · TrOCR
