Automatic Correction of Writing Anomalies in Hausa Texts

Ahmad Mustapha Wali; Sergiu Nisioi

arXiv:2506.03820·cs.CL·May 5, 2026

Automatic Correction of Writing Anomalies in Hausa Texts

Ahmad Mustapha Wali, Sergiu Nisioi

PDF

TL;DR

This paper develops transformer-based models to automatically correct writing anomalies in Hausa texts, creating a large dataset and demonstrating improved NLP task performance.

Contribution

It introduces a large-scale parallel dataset of noisy and clean Hausa sentences and compares multiple multilingual models for error correction.

Findings

01

M2M100 achieves state-of-the-art correction results.

02

Error correction significantly improves downstream NLP tasks.

03

Synthetic noise generation effectively mimics real writing errors.

Abstract

Hausa texts are often characterized by writing anomalies, such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we create a large-scale parallel dataset of over 400,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise to mimic realistic writing errors. In addition, we finetune several multilingual and African language models, including M2M100, AfriTeVA, NCAIR1/N-ATLaS, UBC-NLP/cheetah-base, and other variants of BART and T5 for this correction task. Our experimental results demonstrate that models such as M2M100 achieve state-of-the-art results despite their smaller size and distinct pretraining, and that correcting errors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.