Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl
Juan-Jos\'e Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avenda\~no-Garrido, Elvys Linhares-Pontes, Luis-Gil Moreno-Jim\'enez

TL;DR
This study investigates whether controlled data duplication can improve NLP performance for low-resource languages, specifically Nahuatl, by expanding corpora and training embeddings evaluated on semantic similarity tasks.
Contribution
It introduces and evaluates a novel incremental duplication technique for corpus expansion in low-resource languages, showing moderate performance gains.
Findings
Incremental duplication yields moderate improvement in semantic similarity tasks.
Controlled duplication can be beneficial for low-resource language NLP.
This is the first known application of this technique in the literature.
Abstract
In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or -languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic -language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new -yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
