Listen, Adapt, Better WER: Source-free Single-utterance Test-time   Adaptation for Automatic Speech Recognition

Guan-Ting Lin; Shang-Wen Li; Hung-yi Lee

arXiv:2203.14222·eess.AS·June 22, 2022

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee

PDF

Open Access 2 Repos

TL;DR

This paper introduces SUTA, a novel test-time adaptation framework for automatic speech recognition that improves performance on out-of-domain and in-domain test samples using single-utterance adaptation without source data access.

Contribution

SUTA is the first to apply test-time adaptation to ASR, enabling effective single-utterance adaptation without delaying inference or requiring source data.

Findings

01

SUTA improves ASR accuracy on multiple out-of-domain datasets.

02

Single-utterance adaptation is effective without batch collection.

03

The method enhances in-domain test performance as well.

Abstract

Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions. Test-time Adaptation (TTA), previously explored in the computer vision area, aims to adapt the model trained on source domains to yield better predictions for test samples, often out-of-domain, without accessing the source data. Here, we propose the Single-Utterance Test-time Adaptation (SUTA) framework for ASR, which is the first TTA study on ASR to our best knowledge. The single-utterance TTA is a more realistic setting that does not assume test data are sampled from identical distribution and does not delay on-demand inference due to pre-collection for the batch of adaptation data. SUTA consists of unsupervised objectives with an efficient adaptation strategy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing