Two CFG Nahuatl for automatic corpora expansion
Juan-Jos\'e Guzm\'an-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avenda\~no-Garrido

TL;DR
This paper introduces two CFG-based methods to generate artificial Nawatl sentences, significantly expanding the corpus to improve embedding learning and semantic similarity evaluation for this low-resource language.
Contribution
The paper presents two novel CFGs for Nawatl, enabling corpus expansion and improved embedding learning in a low-resource language context.
Findings
Expanded corpus improves embedding quality
Artificial data enhances semantic similarity task performance
Economic embeddings outperform some large language models
Abstract
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the -language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling
