The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks
Siyuan Feng, Odette Scharenborg

TL;DR
This paper introduces a two-stage unsupervised subword modeling framework combining autoregressive predictive coding and cross-lingual DNNs, demonstrating superior phoneme and articulatory feature capture over existing methods.
Contribution
It proposes a novel two-stage learning approach that effectively captures phoneme and articulatory features, improving unsupervised subword representation quality across languages.
Findings
Outperforms or matches state-of-the-art on ABX subword discriminability tasks.
Better at capturing diphthongs and articulatory features than monophthongs.
Shows positive correlation between cross-lingual label quality and phoneme information capture.
Abstract
This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
