Curation of a Palaeohispanic Dataset for Machine Learning

Gonzalo Mart\'inez-Fern\'andez; Jose F Quesada; Agust\'in Riscos-N\'u\~nez; Francisco Jos\'e Salguero-Lamillar

arXiv:2604.13070·cs.CL·April 16, 2026

Curation of a Palaeohispanic Dataset for Machine Learning

Gonzalo Mart\'inez-Fern\'andez, Jose F Quesada, Agust\'in Riscos-N\'u\~nez, Francisco Jos\'e Salguero-Lamillar

PDF

TL;DR

This paper presents a structured Palaeohispanic language dataset to facilitate machine learning research, addressing the scarcity and format issues of existing resources.

Contribution

The creation of a curated, machine-learning-ready dataset for Palaeohispanic languages, enabling computational analysis in a field with limited resources.

Findings

01

Dataset enables new computational analyses of Palaeohispanic languages

02

Structured data improves accessibility for machine learning applications

03

Supports further linguistic and archaeological research

Abstract

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.