Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation
Fernando L\'opez, Jordi Luque

TL;DR
This paper introduces an iterative pseudo-forced alignment method using CTC loss for self-supervised domain adaptation in ASR, enabling accurate alignment and adaptation without human annotations.
Contribution
It presents a novel iterative alignment algorithm that refines audio-text alignments using CTC posteriors, improving domain adaptation for end-to-end ASR without manual labels.
Findings
Achieves high-quality alignments on broadcast TV and voice datasets.
Enables effective domain adaptation and semi-supervised training.
No human-revised references needed for alignment and adaptation.
Abstract
High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
