VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic   Features

Tomoki Koriyama

arXiv:2407.02749·eess.AS·September 26, 2024

VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features

Tomoki Koriyama

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel VAE-based phoneme alignment model that leverages gradient annealing and SSL acoustic features to improve boundary accuracy in speech analysis and video content creation.

Contribution

It extends the OTA model with VAE architecture, gradient annealing, and SSL features for unsupervised phoneme boundary detection.

Findings

01

Achieved more accurate phoneme boundaries than conventional OTA and CTC models.

02

Outperformed MFA in phoneme boundary alignment.

03

Demonstrated effectiveness of SSL features in phoneme segmentation.

Abstract

This paper presents an accurate phoneme alignment model that aims for speech analysis and video content creation. We propose a variational autoencoder (VAE)-based alignment model in which a probable path is searched using encoded acoustic and linguistic embeddings in an unsupervised manner. Our proposed model is based on one TTS alignment (OTA) and extended to obtain phoneme boundaries. Specifically, we incorporate a VAE architecture to maintain consistency between the embedding and input, apply gradient annealing to avoid local optimum during training, and introduce a self-supervised learning (SSL)-based acoustic-feature input and state-level linguistic unit to utilize rich and detailed information. Experimental results show that the proposed model generated phoneme boundaries closer to annotated ones compared with the conventional OTA model, the CTC-based segmentation model, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyama5/vae_align
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques