Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Juan-Jos\'e Guzman-Landa; Juan-Manuel Torres-Moreno; Graham Ranger; Miguel Figueroa-Saavedra; Martha-Lorena Avenda\~no-Garrido; Elvys Linhares-Pontes; Luis-Gil Moreno-Jim\'enez

arXiv:2604.07015·cs.CL·April 9, 2026

Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Juan-Jos\'e Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avenda\~no-Garrido, Elvys Linhares-Pontes, Luis-Gil Moreno-Jim\'enez

PDF

TL;DR

This study investigates whether controlled data duplication can improve NLP performance for low-resource languages, specifically Nahuatl, by expanding corpora and training embeddings evaluated on semantic similarity tasks.

Contribution

It introduces and evaluates a novel incremental duplication technique for corpus expansion in low-resource languages, showing moderate performance gains.

Findings

01

Incremental duplication yields moderate improvement in semantic similarity tasks.

02

Controlled duplication can be beneficial for low-resource language NLP.

03

This is the first known application of this technique in the literature.

Abstract

In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$ -languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$ -language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$ -yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.