Sequence-level self-learning with multiple hypotheses

Kenichi Kumatani; Dimitrios Dimitriadis; Yashesh Gaur; Robert Gmyr,; Sefik Emre Eskimez; Jinyu Li; Michael Zeng

arXiv:2112.05826·cs.CL·December 23, 2021

Sequence-level self-learning with multiple hypotheses

Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr,, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

PDF

Open Access

TL;DR

This paper introduces a novel sequence-level self-learning approach using multiple hypotheses within a multi-task learning framework to improve speech recognition, especially in accent adaptation and federated learning scenarios.

Contribution

It proposes a new multi-hypothesis self-learning method for seq2seq ASR models that mitigates errors from imperfect hypotheses and enhances adaptation.

Findings

01

Reduced WER from 14.55% to 10.36% in accent adaptation.

02

Effective in federated learning scenarios.

03

Improves robustness against hard-decision errors.

Abstract

In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the \emph{multi-task learning} (MTL) framework where the $n$ -th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the \emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsSelf-Learning · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence