Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model
Apoorv Vyas, Srikanth Madikeri, Herv\'e Bourlard

TL;DR
This paper compares CTC and LFMMI training methods for wav2vec 2.0 models in out-of-domain and cross-lingual speech recognition, showing both methods significantly improve performance over supervised baselines, with LFMMI slightly outperforming CTC.
Contribution
It demonstrates that wav2vec 2.0 pretrained models help mitigate overfitting in CTC training and compares CTC and LFMMI effectiveness across multiple datasets and languages.
Findings
Both CTC and LFMMI achieve similar results in supervised adaptation.
Significant relative WER reductions over supervised baselines across datasets.
LFMMI slightly outperforms CTC in most scenarios.
Abstract
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including out-of-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show that for supervised adaptation of the wav2vec 2.0 model, both E2E-LFMMI and CTC achieve similar results; significantly outperforming the baselines trained only with supervised data. Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the following relative WER improvements over the supervised baseline trained with E2E-LFMMI. We get relative improvements of 40% and 44%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
