Throat and acoustic paired speech dataset for deep learning-based speech enhancement
Yunsik Kim, Yonghun Song, and Yoonyoung Chung

TL;DR
This paper introduces the TAPS dataset, a paired throat and acoustic speech dataset for deep learning, along with an alignment method, to advance speech enhancement research in noisy environments.
Contribution
The paper presents the first standardized paired speech dataset with an alignment method, enabling improved deep learning-based speech enhancement from throat microphones.
Findings
Mapping-based models outperform others in speech quality restoration
TAPS dataset effectively supports deep learning research in noisy environments
Optimal alignment improves signal matching between throat and acoustic recordings
Abstract
In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
