VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features
Tomoki Koriyama

TL;DR
This paper introduces a novel VAE-based phoneme alignment model that leverages gradient annealing and SSL acoustic features to improve boundary accuracy in speech analysis and video content creation.
Contribution
It extends the OTA model with VAE architecture, gradient annealing, and SSL features for unsupervised phoneme boundary detection.
Findings
Achieved more accurate phoneme boundaries than conventional OTA and CTC models.
Outperformed MFA in phoneme boundary alignment.
Demonstrated effectiveness of SSL features in phoneme segmentation.
Abstract
This paper presents an accurate phoneme alignment model that aims for speech analysis and video content creation. We propose a variational autoencoder (VAE)-based alignment model in which a probable path is searched using encoded acoustic and linguistic embeddings in an unsupervised manner. Our proposed model is based on one TTS alignment (OTA) and extended to obtain phoneme boundaries. Specifically, we incorporate a VAE architecture to maintain consistency between the embedding and input, apply gradient annealing to avoid local optimum during training, and introduce a self-supervised learning (SSL)-based acoustic-feature input and state-level linguistic unit to utilize rich and detailed information. Experimental results show that the proposed model generated phoneme boundaries closer to annotated ones compared with the conventional OTA model, the CTC-based segmentation model, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
