Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker Comparison
Feng Huang, Peter Balazs

TL;DR
This paper introduces a harmonic-aligned frame mask based on non-stationary Gabor transform for speech signals, enhancing speaker comparison by capturing harmonic features with improved alignment.
Contribution
It extends frame mask techniques to speech signals using NSGT with pitch-dependent resolution, enabling better harmonic alignment for speaker identification.
Findings
Frame masks effectively differentiate speaker identities.
Deep neural networks validate the mask's ability to represent speaker features.
Potential for speaker identification with limited data.
Abstract
We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analysis. This study extends the frame mask approach to speech signals. For voiced speech, the fundamental frequency is usually changing consecutively over time. We employ NSGT with pitch-dependent and therefore time-varying frequency resolution to attain harmonic alignment in the transform domain and hence yield harmonic-aligned frame masks for speech signals. We propose to apply the harmonic-aligned frame mask to content-dependent speaker comparison. Frame masks, computed from voiced signals of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Image and Signal Denoising Methods
