Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM   Dataset

Neil Shah; Shirish Karande; Vineet Gandhi

arXiv:2412.18839·cs.SD·January 24, 2025

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Neil Shah, Shirish Karande, Vineet Gandhi

PDF

Open Access

TL;DR

This paper improves NAM-to-speech conversion by learning phoneme alignments directly from NAMs, incorporating lip modality with diffusion models, and introduces the MultiNAM dataset for benchmarking.

Contribution

It presents novel methods for NAM-to-speech conversion, including direct phoneme alignment learning, lip modality integration, and a diffusion-based approach, along with a new comprehensive dataset.

Findings

01

Enhanced speech intelligibility and speaker generalization.

02

Effective lip-to-speech synthesis using diffusion models.

03

Benchmark results demonstrating improvements over existing methods.

Abstract

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsNeural Additive Model · Focus