Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks
Md Akmal Haidar, Mehdi Rezagholizadeh

TL;DR
This paper proposes a novel adversarial fine-tuning method for pre-trained end-to-end speech recognition models using GANs, improving performance on large datasets like LibriSpeech.
Contribution
It introduces a GAN-based fine-tuning framework for pre-trained ASR models, addressing convergence issues and enhancing recognition accuracy on large corpora.
Findings
Outperforms baseline models on LibriSpeech dataset
Demonstrates effective adversarial fine-tuning of pre-trained ASR models
Shows improved robustness and accuracy in speech recognition
Abstract
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to learn the true data representation through a two-player min-max game. However, training an E2E ASR model using a large ASR corpus with a GAN framework has never been explored, because it might take excessively long time due to high-variance gradient updates and face convergence issues. In this paper, we introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective where the ASR model acts as a generator and a discriminator tries to distinguish the ASR output from the real data. Since the ASR model is pre-trained, we hypothesize that the ASR model output (soft distribution vectors) helps to get higher scores from the discriminator and makes the task of the discriminator harder within our GAN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
