Breaking the Barriers of Text-Hungry and Audio-Deficient AI

Hamidou Tembine; Issa Bamia; Massa NDong; Bakary Coulibaly; Oumar Issiaka Traore; Moussa Traore; Moussa Sanogo; Mamadou Eric Sangare; Salif Kante; Daryl Noupa Yongueng; Hafiz Tiomoko Ali; Malik Tiomoko; Frejus Laleye; Boualem Djehiche; Wesmanegda Elisee Dipama; Idris Baba Saje; Hammid Mohammed Ibrahim; Moumini Sanogo; Marie Coursel Nininahazwe; Abdul-Latif Siita; Haine Mhlongo; Teddy Nelvy Dieu Merci Kouka; Mariam Serine Jeridi; Mutiyamuogo Parfait Mupenge; Lekoueiry Dehah; Abdoul Aziz Bio Sidi Bouko; Wilfried Franceslas Zokoue; Odette Richette Sambila; Alina RS Mbango; Mady Diagouraga; Oumarou Moussa Sanoussi; Gizachew Dessalegn; Mohamed Lamine Samoura; Bintou Laetitia Audrey Coulibaly

arXiv:2506.02443·cs.SD·June 4, 2025

Breaking the Barriers of Text-Hungry and Audio-Deficient AI

Hamidou Tembine, Issa Bamia, Massa NDong, Bakary Coulibaly, Oumar Issiaka Traore, Moussa Traore, Moussa Sanogo, Mamadou Eric Sangare, Salif Kante, Daryl Noupa Yongueng, Hafiz Tiomoko Ali, Malik Tiomoko, Frejus Laleye, Boualem Djehiche, Wesmanegda Elisee Dipama, Idris Baba Saje

PDF

Open Access

TL;DR

This paper introduces a novel textless, audio-to-audio AI framework that leverages new architectures and a multiscale audio-semantic transform to generate high-fidelity speech, expanding language technology access to underserved, audio-literate populations.

Contribution

It presents the first fully audio-based translation architectures and a multiscale audio-semantic transform, enabling scalable, textless speech generation across diverse languages.

Findings

01

High-fidelity speech generation without textual supervision

02

Effective processing of unwritten or low-resource languages

03

Scalable system learning directly from raw audio

Abstract

While global linguistic diversity spans more than 7164 recognized languages, the current dominant architecture of machine intelligence remains fundamentally biased toward written text. This bias excludes over 700 million people particularly in rural and remote regions who are audio-literate. In this work, we introduce a fully textless, audio-to-audio machine intelligence framework designed to serve this underserved population, and all the people who prefer audio-efficiency. Our contributions include novel Audio-to-Audio translation architectures that bypass text entirely, including spectrogram-, scalogram-, wavelet-, and unit-based models. Central to our approach is the Multiscale Audio-Semantic Transform (MAST), a representation that encodes tonal, prosodic, speaker, and expressive features. We further integrate MAST into a fractional diffusion of mean-field-type framework powered by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Natural Language Processing Techniques