Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Arij Riabi; Beno\^it Sagot; Djam\'e Seddah

arXiv:2110.13658·cs.CL·June 4, 2025·1 cites

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Arij Riabi, Beno\^it Sagot, Djam\'e Seddah

PDF

Open Access

TL;DR

This paper demonstrates that character-based language models trained on limited, noisy, low-resource language data can achieve competitive performance on downstream NLP tasks, offering a promising approach for underrepresented languages.

Contribution

It introduces a character-based language model approach for low-resource, high-variability languages and shows its effectiveness compared to larger pre-trained models.

Findings

01

Character-based models perform well with limited data.

02

Models trained on 99k sentences achieve near state-of-the-art results.

03

Effective on both North-African dialectal Arabic and noisy French data.

Abstract

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling