AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
Mardiyyah Oduwole, Prince Mireku, Fatimo Adebanjo, Oluwatosin Olajide, Mahi Aminu Aliyu, Jekaterina Novikova

TL;DR
This paper introduces AfriCaption, a scalable, multilingual image captioning framework for 20 African languages, including a new dataset, a dynamic pipeline, and a large vision-to-text model to promote inclusive AI.
Contribution
It presents the first comprehensive, scalable image captioning resource and model for under-represented African languages, addressing resource scarcity and inclusivity in multimodal AI.
Findings
Curated a new dataset with semantically aligned captions in 20 African languages.
Developed a dynamic, quality-preserving captioning pipeline.
Built a 0.5B parameter vision-to-text model integrating SigLIP and NLLB200.
Abstract
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Language, Metaphor, and Cognition
