Automatic Correction of Writing Anomalies in Hausa Texts
Ahmad Mustapha Wali, Sergiu Nisioi

TL;DR
This paper develops transformer-based models to automatically correct writing anomalies in Hausa texts, creating a large dataset and demonstrating improved NLP task performance.
Contribution
It introduces a large-scale parallel dataset of noisy and clean Hausa sentences and compares multiple multilingual models for error correction.
Findings
M2M100 achieves state-of-the-art correction results.
Error correction significantly improves downstream NLP tasks.
Synthetic noise generation effectively mimics real writing errors.
Abstract
Hausa texts are often characterized by writing anomalies, such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we create a large-scale parallel dataset of over 400,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise to mimic realistic writing errors. In addition, we finetune several multilingual and African language models, including M2M100, AfriTeVA, NCAIR1/N-ATLaS, UBC-NLP/cheetah-base, and other variants of BART and T5 for this correction task. Our experimental results demonstrate that models such as M2M100 achieve state-of-the-art results despite their smaller size and distinct pretraining, and that correcting errors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
