Real-time speech enhancement in noise for throat microphone using neural audio codec as foundation model
Julien Hauret, Thomas Joubaud, \'Eric Bavu

TL;DR
This paper demonstrates real-time speech enhancement for throat microphone recordings in noisy environments using a fine-tuned neural audio codec, improving audio quality and robustness in a practical demo setting.
Contribution
It introduces a novel pipeline combining throat microphone recordings with a fine-tuned neural audio codec for real-time speech enhancement in noisy conditions.
Findings
Superior performance compared to state-of-the-art models
Real-time inference with low latency
Effective noise attenuation in throat microphone recordings
Abstract
We present a real-time speech enhancement demo using speech captured with a throat microphone. This demo aims to showcase the complete pipeline, from recording to deep learning-based post-processing, for speech captured in noisy environments with a body-conducted microphone. The throat microphone records skin vibrations, which naturally attenuate external noise, but this robustness comes at the cost of reduced audio bandwidth. To address this challenge, we fine-tune Kyutai's Mimi--a neural audio codec supporting real-time inference--on Vibravox, a dataset containing paired air-conducted and throat microphone recordings. We compare this enhancement strategy against state-of-the-art models and demonstrate its superior performance. The inference runs in an interactive interface that allows users to toggle enhancement, visualize spectrograms, and monitor processing latency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
