Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset
Neil Shah, Shirish Karande, Vineet Gandhi

TL;DR
This paper improves NAM-to-speech conversion by learning phoneme alignments directly from NAMs, incorporating lip modality with diffusion models, and introduces the MultiNAM dataset for benchmarking.
Contribution
It presents novel methods for NAM-to-speech conversion, including direct phoneme alignment learning, lip modality integration, and a diffusion-based approach, along with a new comprehensive dataset.
Findings
Enhanced speech intelligibility and speaker generalization.
Effective lip-to-speech synthesis using diffusion models.
Benchmark results demonstrating improvements over existing methods.
Abstract
Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsNeural Additive Model · Focus
