Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus
Anastasia Ananeva, Anton Tomilov, Marina Volkova

TL;DR
This paper investigates whether HubertSoft embeddings encode temporal and phonetic boundary information, revealing that they capture phoneme identity, order, and articulatory details at segment boundaries, enhancing understanding of SSL speech representations.
Contribution
It demonstrates that HubertSoft embeddings encode temporal structure and phoneme transitions, providing insights into their internal phonological information content.
Findings
Embeddings at phoneme boundaries encode phoneme identity and order.
High accuracy in predicting boundary positions indicates temporal sensitivity.
Embeddings reflect articulatory and coarticulatory features.
Abstract
Self-supervised learning (SSL) models such as Wav2Vec 2.0 and HuBERT have shown remarkable success in extracting phonetic information from raw audio without labelled data. While prior work has demonstrated that SSL embeddings encode phonetic features at the frame level, it remains unclear whether these models preserve temporal structure, specifically, whether embeddings at phoneme boundaries reflect the identity and order of adjacent phonemes. This study investigates the extent to which boundary-sensitive embeddings from HubertSoft, a soft-clustering variant of HuBERT, encode phoneme transitions. Using the CORPRES Russian speech corpus, we labelled 20 ms embedding windows with triplets of phonemes corresponding to their start, centre, and end segments. A neural network was trained to predict these positions separately, and multiple evaluation metrics, such as ordered, unordered accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Neuroscience and Music Perception
