TL;DR
This paper introduces a lattice-free MMI adaptation method for self-supervised pretrained acoustic models, demonstrating significant WER improvements across multiple datasets and languages.
Contribution
It presents a novel LFMMI-based supervised adaptation technique for self-supervised pretrained models, improving speech recognition accuracy.
Findings
10-35% relative WER reduction on Librispeech
10.8% WER reduction on Switchboard
4-4.3% WER reduction on Swahili and Tagalog
Abstract
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.3% on Swahili (38h) and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Attention Is All You Need · Dropout · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding
