SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky

TL;DR
SumTablets is a comprehensive dataset pairing Sumerian cuneiform glyphs with transliterations, enabling advanced NLP applications and improving transliteration accuracy with transformer models.
Contribution
The paper introduces SumTablets, the first large-scale dataset linking Sumerian cuneiform glyphs with transliterations, facilitating NLP research in Assyriology.
Findings
Achieved 97.55% chrF score with a transformer-based transliteration model.
Provided open access to a dataset of over 91,000 tablets and 6.9 million glyphs.
Demonstrated the potential for NLP methods to assist Sumerian transliteration tasks.
Abstract
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Ancient Near East History · Natural Language Processing Techniques
