Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages
Daniel J Wu, Andrew C Yang, Vinay U Prabhu

TL;DR
Afro-MNIST introduces synthetic MNIST-style datasets for four African orthographies, enabling machine learning applications in low-resource languages and providing a method for generating such datasets from minimal examples.
Contribution
The paper presents Afro-MNIST datasets for four African scripts and a novel method to generate synthetic MNIST-style datasets from single examples.
Findings
Datasets serve as drop-in replacements for MNIST.
Open-source dataset generation method from single examples.
Supports machine learning education in underrepresented languages.
Abstract
We present Afro-MNIST, a set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST. We also describe and open-source a method for synthetic MNIST-style dataset generation from single examples of each digit. These datasets can be found at https://github.com/Daniel-Wu/AfroMNIST. We hope that MNIST-style datasets will be developed for other numeral systems, and that these datasets vitalize machine learning education in underrepresented nations in the research community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
