Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach
Koren Lazar, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan, Wasserman, Gabriel Stanovsky

TL;DR
This paper introduces masked language models tailored for ancient Akkadian texts to automate the restoration of missing parts in cuneiform documents, aiding scholars in deciphering deteriorated artifacts.
Contribution
It develops specialized language models for Akkadian, demonstrating effective text completion despite limited data and validating usefulness through human expert evaluations.
Findings
Achieved 89% hit@5 accuracy in missing token prediction
Pretraining on multilingual and historical data improves performance
Models assist experts in transcribing ancient texts
Abstract
We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
