Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation
Siyuan Feng, Tan Lee

TL;DR
This paper introduces a method to improve unsupervised subword modeling by disentangling linguistic content and speaker information using FHVAE, leading to more accurate frame labels and better speech representations in zero-resource scenarios.
Contribution
The study proposes integrating FHVAE with DNN-BNF to enhance frame label quality and speaker invariance in unsupervised speech modeling, demonstrating significant error rate reductions.
Findings
Achieved 2.4% and 0.6% ABX error rate reductions in zero-resource speech tasks.
Outperformed vocal tract length normalization in improving subword modeling.
Disentangling speaker and linguistic features improves unsupervised speech representations.
Abstract
This study tackles unsupervised subword modeling in the zero-resource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments…
| Training | Test | |||
| Duration | #speakers-R111“speakers-R/-L” denotes speakers with rich/limited speech data. | #speakers-L11footnotemark: 1 | Duration | |
| English | hrs | hrs | ||
| French | hrs | hrs | ||
| Mandarin | hrs | hrs | ||
| ID | Across-speaker | Within-speaker | ||||||||||||||||||||
| English | French | Mandarin | Avg. | English | French | Mandarin | Avg. | |||||||||||||||
| 1s | 10s | 120s | 1s | 10s | 120s | 1s | 10s | 120s | 1s | 10s | 120s | 1s | 10s | 120s | 1s | 10s | 120s | |||||
| Baseline | ||||||||||||||||||||||
| CA-Sup [9] | ||||||||||||||||||||||
| MFCC [10] | ||||||||||||||||||||||
| MFCC+VTLN [10] | ||||||||||||||||||||||
| \raisebox{-.9pt}{1}⃝ | Orig. | |||||||||||||||||||||
| \raisebox{-.9pt}{2}⃝ | Orig. | |||||||||||||||||||||
| \raisebox{-.9pt}{3}⃝ | -s0107 | |||||||||||||||||||||
| \raisebox{-.9pt}{4}⃝ | -s0107 | |||||||||||||||||||||
| \raisebox{-.9pt}{5}⃝ | -s4018 | |||||||||||||||||||||
| \raisebox{-.9pt}{6}⃝ | -s4018 | |||||||||||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation
Abstract
This study tackles unsupervised subword modeling in the zero-resource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve and absolute ABX error rate reductions in across- and within-speaker conditions, comparing to the baseline DNN-BNF system without applying FHVAEs. Our proposed approaches significantly outperform vocal tract length normalization in improving frame labeling and subword modeling.
Index Terms: unsupervised subword modeling, disentangled representation, speaker-invariant feature, zero resource
1 Introduction
Recent years have witnessed a huge success in applying deep learning techniques in acoustic and language modeling for automatic speech recognition (ASR). Training deep neural network (DNN) acoustic models requires large amounts of transcribed speech data. For many languages in the world, for which very little or no transcribed speech is available, conventional supervised acoustic modeling techniques cannot be directly applied.
Unsupervised acoustic modeling (UAM) aims at discovering and modeling acoustic units of an unknown language at subword or word level, assuming only untranscribed speech data are available. UAM is a challenging problem with significant practical impact in speech as well as linguistics and cognitive science communities. It has been studied in applications such as ASR for low-resource languages [1], language identification [2] and query-by-example spoken term detection [3]. This problem is also relevant to endangered language protection [4] and understanding infants’ language acquisition mechanism [5].
Over the recent past, Zero Resource Speech Challenges (ZeroSpeech) 2015 [6] and 2017 [7] were organized to focus on unsupervised speech modeling. ZeroSpeech 2017 Track one, named unsupervised subword modeling, was formulated as an unsupervised feature representation learning problem, i.e., how to learn frame-level speech features that are discriminative to subword units and robust to linguistically-irrelevant variations such as speaker identity. The present study addresses this problem. It is a fundamental problem in unsupervised speech modeling. Speech simultaneously encodes linguistically-relevant information e.g. subword units and linguistically-irrelevant information e.g. speaker variation that are not easily separable. In supervised acoustic modeling, golden transcription can be relied on to ensure the robustness of the learned subword units towards linguistically-irrelevant information. In the unsupervised scenario, subword units and word patterns can only be inferred from speech features. This makes feature representation learning important in the zero-resource scenario. In the literature, representation learning has been shown beneficial to downstream applications such as spoken query retrieval [8].
In our previous attempt to ZeroSpeech 2017 [9], a DNN was trained with zero-resource speech data to generate bottleneck features (BNFs) as the learned feature representation. Frame labels for supervised DNN training were obtained through Dirichlet process Gaussian mixture model (DPGMM) based frame clustering. This framework is similar to [10]. By employing out-of-domain transcribed speech data for speaker adapted feature learning and DNN frame labeling, the results in [9] significantly outperform [10] in which out-of-domain data were not employed. This improvement is mainly attributed to the advancement of frame label acquisition. Ideally, the learned frame labels should have a full coverage of linguistically-defined phonemes. They should be in good consistency with golden transcription and robust to speaker variation. The quality of frame labels has a significant impact on the performance of subword modeling [11]. Many prior works found out that DPGMM clustering towards speaker adapted features could generate better labels than that towards unadapted features [12, 11, 10]. In [10], the authors compared MFCC features with and without vocal tract length normalization (VTLN) for clustering. In [11], MFCCs were first clustered to generate initial tokenization, with which linear transforms such as LDA, MLLT and fMLLR were estimated. The fMLLRs are clustered again to generate the final form of frame labels. This work achieved the best performance in ZeroSpeech 2017. It is worth noting that DPGMM clustering requires high computational costs. Typically, clustering towards -hour speech data for iterations using CPU cores takes up to hours. This makes the system in [11] much heavier than [10, 9].
In the strict zero-resource scenario, out-of-domain speech and language resources are unavailable. This paper proposes to improve DPGMM frame labeling using only in-domain untranscribed speech data, and refrain from performing multiple-pass clustering processes. Specifically, the factorized hierarchical variational AE (FHVAE) model [13] is used to disentangle linguistic content and speaker information in raw speech features in an unsupervised manner. By either discarding or unifying speaker information, speaker-invariant representation is learned and used as the input to DPGMM clustering and DNN-BNF training. The FHVAE is an unsupervised generative model. It was originally proposed to deal with domain adaptation problems in noise robust ASR [14], distant conversational ASR [15], and later applied to dialect identification [16]. To the best of our knowledge, the use of FHVAEs in unsupervised subword modeling has never been studied before.
2 Speaker-invariant feature learning by FHVAE
Speaker characteristics tends to have a smaller amount of variation than linguistic content within a speech utterance, while linguistic content tends to have similar amounts of variation within and across utterances. The FHVAE model [13], which learns to factorize sequence-level and segment-level attributes of sequential data into different latent variables, is applied in this work to disentangle linguistic content and speaker characteristics.
2.1 FHVAE model
FHVAEs formulate the generation process of sequential data by imposing sequence-dependent priors and sequence-independent priors to different sets of variables. Following notations and terminologies in [13], let and denote latent segment variable and latent sequence variable, respectively. is sequence-dependent prior, named as s-vector. and denote the parameters of generation and inference models of FHVAEs. Let denote a speech dataset with sequences. Each contains speech segments , where is composed of fixed-length consecutive frames. The FHVAE model generates a sequence from a random process as follows: (1) is drawn from a prior distribution ; (2) and are drawn from and respectively; (3) Speech segment is drawn from . Here denotes standard normal distribution, and are parameterized by DNNs. The joint probability for is formulated as,
[TABLE]
Similar to VAE models, FHVAEs introduce an inference model to approximate the intractable true posterior as,
[TABLE]
Here and are all diagonal Gaussian distributions. The mean and variance values of and are parameterized by two DNNs. For , during FHVAE training, a trainable lookup table containing posterior mean of for each sequence is updated. During testing, maximum a posteriori (MAP) estimation is used to infer for unseen test sequences. Details of estimation for test sequences are described in [13].
FHVAEs optimize the discriminative segmental variational lower bound defined as,
[TABLE]
where is sequence index, denotes posterior mean of for the -th sequence, denotes the discriminative weight. The discriminative objective is defined as .
After FHVAE training, encodes factors that are relatively consistent within a sequence. The discriminative objective ensures that captures sequence-dependent information. encodes residual factors that are sequence-independent.
2.2 Extracting speaker-invariant features by FHVAE
In order to apply the FHVAE model to speaker-invariant feature learning, training utterances of the same speaker are concatenated into a single sequence. By this means, is expected to encode speaker identity information and carry little phonetic information. is expected to encode residual information, i.e. linguistic content, and carry little speaker information. This work considers obtaining speaker-invariant feature representations based on a trained FHVAE by two methods. The first method is straightforward to treat latent segment variables as the desired feature representation.
In the second method, the FHVAE model reconstructs speech features of all utterances based on a unified s-vector. The reconstructed features are the desired representation. Specifically, a representative speaker with his/her s-vector is chosen from the dataset. Next, for each speech segment of an arbitrary speaker , its corresponding latent sequence variable is transformed to , where denotes the s-vector of speaker . Finally the FHVAE decoder reconstructs speech segment conditioned on and using . This method is named as s-vector unification in this work. Compared to original features, reconstructed features are expected to keep the linguistic content unchanged and capture speaker characteristics corresponding to the representative speaker. In other words, speech synthesized from would tend to sound as if they were all spoken by the representative speaker.
3 Unsupervised subword modeling with speaker-invariant features
3.1 DNN-BNF architecture
A DNN-BNF architecture [10, 9] is adopted to perform phonetic discriminative training of untranscribed speech data and generate BNFs for subword modeling. In this architecture, given untranscribed speech data, Dirichlet process Gaussian mixture model (DPGMM) [17] algorithm is applied to cluster frame-level MFCC features for each target language individually. After clustering, each frame is assigned with a cluster label. These frame labels are regarded as pseudo phoneme alignments to support supervised DNN training. A multilingual DNN with a linear bottleneck layer is trained with frame alignments and MFCC features for all the target languages simultaneously, using multi-task learning [18]. After training, multilingual BNFs are extracted as the subword discriminative representation.
3.2 DNN-BNF training with speaker-invariant features
Speaker-invariant features learned by FHVAEs are applied to the DNN-BNF architecture in two aspects. As can be seen in Figure 1, during DPGMM-based frame clustering, input features to DPGMM are reconstructed MFCCs generated by the FHVAE decoder network using the s-vector unification method described in Section 2.2, instead of original MFCCs. Compared to original MFCCs, FHVAE reconstructed MFCCs carry speaker information that is more consistent across utterances spoken by different speakers. With the reconstructed features as inputs, DPGMM clustering is expected to generate better phoneme-like labels and less affected by speaker variation.
During DNN-BNF model training, FHVAE-based speaker-invariant features are fed as inputs to the DNN. As seen in Figure 1, in this study we consider two feature types, i.e. reconstructed MFCCs with s-vector unification and latent segment variables , as DNN inputs. The effectiveness of these two types of features is compared in this study.
4 Experimental setup
4.1 Dataset and evaluation metric
Experiments are carried out with ZeroSpeech 2017 Track one [7]. Speaker identity information is released only for train sets. Detailed information is listed in Table 1.
The evaluation metric is ABX subword discriminability. The ABX task is to decide whether belongs to or if belongs to and belongs to , where , and are three speech segments, and are two phonemes that differ in the central sound (e.g., “beg”-“bag”). Each pair of and are generated by the same speaker. ABX error rates for within-speaker and across-speaker are evaluated separately, depending on whether and belong to the same speaker.
4.2 FHVAE setup and parameter tuning
FHVAE model parameters are determined by reference to [14]. The encoder and decoder networks of FHVAE are both -layer LSTMs with neurons per layer. The dimensions of and are . Training data for the three target languages are merged to train the FHVAE. Input features are fixed-length speech segments randomly chosen from utterances. The determination of segment length is discussed in the next paragraph. Each frame is represented by a -dimensional MFCC with cepstral mean normalization at speaker level. During the inference of reconstructed feature representation, input segments are shifted by frame. To match the length of extracted features with original MFCCs, the first and last frame are padded. Adam [19] with and is used to train the FHVAE. A subset of training data is randomly selected for cross-validation. The training process is terminated if the lower bound on the cross-validation set does not improve for epochs. Open-source tools [13] are used to train FHVAEs.
In our preliminary experiments, the ABX performance of was found to be sensitive to the input segment length . This could be explained as: a too large would reduce the capability of in modeling linguistic content at subword level; a too small would restrict the FHVAE from capturing sufficient temporal dependencies which are essential in modeling speech. ABX error rates on with different values of are shown in Figure 2. The optimal value of is . For the remaining experiments in this work, is fixed to .
4.3 Selecting representative speaker for reconstructed feature extraction
The extraction of reconstructed MFCCs using s-vector unification assumes a pre-defined representative speaker. In order to validate the generalization ability of our proposed s-vector unification method and evaluate its sensitivity to the gender of the representative speaker, English speakers {s0107, s3020, s4018, s0019, s1724, s2544}, French speakers {M02R, M03R, F01R, F02R} and Mandarin speakers {A08, C04} are randomly chosen from ‘speaker-R’ sets of ZeroSpeech 2017 training data. The first half speakers inside each language set are male and the second half are female. During the extraction of , s-vectors of all three target languages’ utterances are modified to the same corresponding to one of the speakers mentioned above. The performance of the groups of is evaluated by the ABX discriminability task.
4.4 DNN-BNF setup
For the baseline system without using FHVAE-based speaker-invariant features, input features to DPGMM are -dimensional MFCCs++. The numbers of clustering iterations for English, French and Mandarin sets are and . After clustering, each frame is assigned with a label. A DNN-BNF is trained with all three languages’ cepstral mean normalized MFCCs++ and frame labels using multi-task learning with equal task weights. The dimensions of hidden layers are . After training, -dimensional BNFs for test sets are extracted and evaluated by the ABX task. DPGMM is implemented using tools developed by [17]. DNN-BNF training is implemented using Kaldi nnet1 recipe [20].
For the systems employing FHVAE-based speaker-invariant features, input features to DPGMM are reconstructed MFCCs with s-vector unification and further appended by +. The representative speaker is selected from the speakers mentioned in Section 4.3. The numbers of clustering iterations for the three languages are and . DNN-BNFs are trained with either reconstructed MFCCs or latent segment variables . The extraction of is slightly different from . During the inference of for training sets, s-vector unification is not applied; during the inference for test sets, s-vector unification is applied within every test subset with a subset-specific . The reason is that DNN-BNFs trained with were found to outperform those trained with . The DNN-BNF mentioned here has the same structure and loss function as that in the baseline system.
5 Results and analyses
5.1 Effectiveness of reconstructed MFCCs
ABX error rates on the groups of reconstructed MFCCs using s-vector unification is shown in Figure 3. Each group is presented as a bar inside a bar graph. The reference line denotes ABX error rate on latent segment variables . It can be observed that, outperform in across-speaker condition regardless of choosing any of the speakers as the representative. In within-speaker condition, perform slightly better than in most of the male cases, and are worse in all female cases. Further studies are needed to explain why male speakers are more suitable than females for s-vector unification.
5.2 DNN-BNFs trained with reconstructed MFCCs
Experimental results of the baseline DNN-BNF system and systems adopting FHVAE-based speaker-invariant features are summarized in Table 2. The second and third columns of IDs \raisebox{-.9pt}{1}⃝-\raisebox{-.9pt}{6}⃝ denote inputs to DNN-BNF training and DPGMM clustering, respectively. ‘Orig.’ denotes original MFCCs without reconstruction. ‘-s0107/-s4018’ denotes reconstructed MFCCs with representative speaker s0107 or s4018. Here, -s4018 is used to represent the ideal case as s4018 performs the best among the speakers in across-speaker condition (see Figure 3). -s0107 represents the general case as s0107 performs moderately among the male speakers. The system exploiting a Cantonese ASR for fMLLR estimation [9] is denoted as ‘CA-Sup’. From this Table, several observations can be made:
(1) The comparison between baseline and \raisebox{-.9pt}{1}⃝ & \raisebox{-.9pt}{2}⃝ shows that without improving frame labels, the DNN-BNF model trained with or outperforms that trained with raw MFCCs, especially in across-speaker condition.
(2) The reconstructed MFCC features significantly outperform original MFCCs in DPGMM frame labeling. In the ideal case where the representative speaker ‘s4018’ is selected, by comparing \raisebox{-.9pt}{5}⃝ and \raisebox{-.9pt}{1}⃝, frame labeling based on contributes to and relative ABX error rate reductions in across- and within-speaker conditions, compared to that based on original MFCCs. In the general case where ‘s0107’ is selected, by comparing \raisebox{-.9pt}{3}⃝ and \raisebox{-.9pt}{1}⃝, the relative error rate reductions are and in across- and within-speaker conditions. The results demonstrate the importance of applying FHVAE-based speaker-invariant features in frame labeling.
(3) Our best system \raisebox{-.9pt}{5}⃝ achieves and absolute ( and relative) ABX error rate reductions compared to the baseline DNN-BNF system in across- and within-speaker conditions. The error rate reductions are attributed to better frame labeling and more speaker-invariant input features. As can be seen from baseline, \raisebox{-.9pt}{1}⃝ and \raisebox{-.9pt}{5}⃝, the improvement in frame labeling is more prominent than that in input features. Compared to system CA-Sup in which out-of-domain transcribed data are exploited, \raisebox{-.9pt}{5}⃝ is slightly better in within-speaker condition while slightly inferior in across-speaker condition.
We also compare the effectiveness of our proposed approaches with [10], in which VTLN was adopted to improve frame labeling. As seen in Table 2, in across-speaker condition, while our baseline system is inferior to their baseline (MFCC), our best system consistently outperforms their system MFCC+VTLN in all test subsets. In within-speaker condition, our proposed approaches also achieve better performance. The comparison shows that FHVAE-based speaker-invariant feature learning is more effective than VTLN in improving the quality of frame labels and the robustness of subword modeling.
6 Conclusions
This paper presents a study on improving the quality of frame labels for unsupervised subword modeling without any out-of-domain resources. Frame labels are generated by clustering towards speaker-invariant features learned from FHVAEs. The speaker-invariant features are further fed as inputs to DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve absolute ABX error rate reductions in across-/within-speaker conditions, compared to the baseline without applying FHVAEs. Compared with a DNN-BNF system in which out-of-domain transcribed data are used for speaker adapted feature learning, our approaches perform slightly better in within-speaker condition while slightly worse in across-speaker condition. Our approaches significantly outperform VTLN in improving the quality of frame labels and the robustness of subword modeling.
7 Acknowledgements
This research is partially supported by the Major Program of National Social Science Fund of China (Ref:13&ZD189), a GRF project grant (Ref: CUHK 14227216) from Hong Kong Research Grants Council and a direct grant from CUHK Research Committee.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Acoustic segment modeling with spectral clustering methods,” IEEE/ACM Trans. ASLP , vol. 23, no. 2, pp. 264–277, 2015.
- 2[2] H. Li, B. Ma, and C.-H. Lee, “A vector space modeling approach to spoken language identification,” IEEE Trans. ASLP , vol. 15, no. 1, pp. 271–284, 2007.
- 3[3] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised bottleneck features for low-resource query-by-example spoken term detection,” in INTERSPEECH , 2016, pp. 923–927.
- 4[4] A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur, K. Church et al. , “A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition,” in Proc. ICASSP , 2013, pp. 8111–8115.
- 5[5] E. Dupoux, “Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner,” ar Xiv , vol. abs/1607.08723, 2016.
- 6[6] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen et al. , “The zero resource speech challenge 2015.” in Proc. INTERSPEECH , 2015, pp. 3169–3173.
- 7[7] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier et al. , “The zero resource speech challenge 2017,” in Proc. ASRU , 2017, pp. 323–330.
- 8[8] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multitask feature learning for low-resource query-by-example spoken term detection,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1329–1339, 2017.
