Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
Alexandros Haliassos, Rodrigo Mira, Stavros Petridis

TL;DR
This paper introduces USR 2.0, a novel speech recognition framework that combines CTC-driven pseudo-labeling with mixed sampling to enhance training efficiency, robustness, and accuracy across multiple speech modalities and challenging conditions.
Contribution
It proposes CTC-driven teacher forcing and mixed sampling techniques, enabling faster, more robust, and more accurate speech recognition without costly beam search.
Findings
Halves training time compared to previous methods.
Improves robustness to out-of-distribution inputs.
Achieves state-of-the-art results on multiple benchmarks.
Abstract
Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can…
Peer Reviews
Decision·ICLR 2026 Poster
* The authors' overall intuitions are reasonable and well-supported by design choices. USR 2.0 introduces a CTC-driven pseudo-labelling approach that effectively removes the autoregressive bottleneck in attention-based decoding, resulting in significantly faster training and improved inference efficiency. * The model unifies ASR, VSR, and AVSR within a single architecture, and demonstrates strong robustness to long inputs, noise, and domain shifts, while maintaining competitive performance on I
* Since the attention decoder must operate autoregressively during inference, the benefit of test-time parallelism is limited, and the speedup primarily applies to the training phase. * Although attention pseudo-labels generated from CTC-driven decoding may lack global coherence, the authors convincingly argue that this does not hinder learning during self-training, as both teacher and student are conditioned on the same token sequence. However, this could limit reuse of such pseudo-labels in n
1. CTC-driven teacher forcing is an elegant idea that leverages the stability and monotonic alignment properties of CTC to guide the attention decoder. It enables parallel pseudo-label generation without slow autoregressive decoding, thereby reducing computational cost and eliminating cascading AR errors. 2. Improved training efficiency: The paper shows a 2× reduction in training time, which is significant for multimodal setups (audio, visual, audiovisual). This is achieved without compromising
The proposed approach relies heavily on the quality of CTC-generated pseudo-labels. While the paper acknowledges that the student decoder is trained to predict the teacher’s outputs under the same CTC-driven inputs, the method still assumes that these pseudo-labels are sufficiently coherent to guide decoder learning. For challenging or noisy segments, however, CTC errors can propagate through teacher forcing since the decoder conditions directly on these imperfect sequences. Consequently, the le
1. The paper itself is clearly written and easy to follow. 2. The paper directly addresses the problems and limitations of USR. 3. The results are robust compared to baselines.
There are no obvious flaws in this paper. Only a few points require clarification (see Questions).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
