Unsupervised Phoneme and Word Discovery from Multiple Speakers using Double Articulation Analyzer and Neural Network with Parametric Bias
Ryo Nakashima, Ryo Ozaki, Tadahiro Taniguchi

TL;DR
This paper introduces an unsupervised method combining Bayesian analysis and neural networks to discover phonemes and words from speech data of multiple speakers, mimicking infant language acquisition.
Contribution
It presents a novel integration of nonparametric Bayesian double articulation analysis with a deep autoencoder with parametric bias for speaker-independent feature extraction.
Findings
DSAE-PB effectively subtracts speaker-dependent features.
The combined method outperforms existing approaches in phoneme and word discovery.
The approach works well with Japanese vowel sequences from multiple speakers.
Abstract
This paper describes a new unsupervised machine learning method for simultaneous phoneme and word discovery from multiple speakers. Human infants can acquire knowledge of phonemes and words from interactions with his/her mother as well as with others surrounding him/her. From a computational perspective, phoneme and word discovery from multiple speakers is a more challenging problem than that from one speaker because the speech signals from different speakers exhibit different acoustic features. This paper proposes an unsupervised phoneme and word discovery method that simultaneously uses nonparametric Bayesian double articulation analyzer (NPB-DAA) and deep sparse autoencoder with parametric bias in hidden layer (DSAE-PBHL). We assume that an infant can recognize and distinguish speakers based on certain other features, e.g., visual face recognition. DSAE-PBHL is aimed to be able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
