Afro-MNIST: Synthetic generation of MNIST-style datasets for   low-resource languages

Daniel J Wu; Andrew C Yang; Vinay U Prabhu

arXiv:2009.13509·cs.CV·September 29, 2020

Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages

Daniel J Wu, Andrew C Yang, Vinay U Prabhu

PDF

Open Access 1 Repo

TL;DR

Afro-MNIST introduces synthetic MNIST-style datasets for four African orthographies, enabling machine learning applications in low-resource languages and providing a method for generating such datasets from minimal examples.

Contribution

The paper presents Afro-MNIST datasets for four African scripts and a novel method to generate synthetic MNIST-style datasets from single examples.

Findings

01

Datasets serve as drop-in replacements for MNIST.

02

Open-source dataset generation method from single examples.

03

Supports machine learning education in underrepresented languages.

Abstract

We present Afro-MNIST, a set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST. We also describe and open-source a method for synthetic MNIST-style dataset generation from single examples of each digit. These datasets can be found at https://github.com/Daniel-Wu/AfroMNIST. We hope that MNIST-style datasets will be developed for other numeral systems, and that these datasets vitalize machine learning education in underrepresented nations in the research community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Daniel-Wu/AfroMNIST
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling