Bilevel Joint Unsupervised and Supervised Training for Automatic Speech   Recognition

Xiaodong Cui; A F M Saif; Songtao Lu; Lisha Chen; Tianyi Chen; Brian; Kingsbury; George Saon

arXiv:2412.08548·cs.CL·December 12, 2024

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

Xiaodong Cui, A F M Saif, Songtao Lu, Lisha Chen, Tianyi Chen, Brian, Kingsbury, George Saon

PDF

Open Access

TL;DR

This paper introduces BL-JUST, a bilevel training framework that jointly optimizes unsupervised and supervised objectives for speech recognition, leading to better acoustic models than traditional pre-training methods.

Contribution

The paper presents a novel bilevel joint training approach for speech recognition that simultaneously optimizes unsupervised and supervised losses, improving model performance.

Findings

01

BL-JUST outperforms pre-training and fine-tuning strategies.

02

It achieves better results than other semi-supervised techniques.

03

The method balances generic and task-specific acoustic representations.

Abstract

In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis