AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko

TL;DR
This paper introduces AV-CPL, a semi-supervised learning method that uses continuous pseudo-labeling to improve audio-visual speech recognition by leveraging unlabeled data and cross-modal information.
Contribution
It presents a novel continuous pseudo-labeling approach for AVSR that eliminates the need for external models and enhances performance on visual speech recognition tasks.
Findings
Significant improvement in VSR accuracy on LRS3 dataset.
Effective utilization of unlabeled visual speech data.
Maintains strong ASR and AVSR performance.
Abstract
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The literature survey is good. 2. Good ablation study for the choice of tokens, and modality fusion method are given.
1. The novelty is very weak. Like mentioned by the authors the work closely resembles Slim-IPL by Likhomanenko et al. and momentum pseudo labeling by Higuchi et al., with the addition of video modality being the only change. Applying CPL for AVSR is not novel either, the only narrow argument for novelty given by the authors is applying a different CPL method that was already established for audio only ASR. This is not a significant originality of idea nor execution. 2. In terms of performance, w
Clearly demonstrates how CPL can be used for AVSR with exhaustive experiments comparing against the literature with supervised, semi-supervised, self-training results. The method could be considered simpler or more hermetic in that only a single model is developed and used to generate the pseudo-labels compared to other semi-supervised results using external models with unknown provenance. They describe difficulties with using pre-trained AV models trained from scratch and suggest using a pre-tr
Overall, the results seem incremental compared to the introduction of CPL for audio-only models with only experiments run on a speech recognition task. I'm of the opinion that this unfortunately greatly limits the strength of the contribution of the paper. There seems to still be some issues with training large models with CPL. One would expect that the "Large" CPL trained model should work better than the "Base", but the final AVSR and ASR only best #s are for Base. Finally, I would argue th
The general structure is clear. The method is simple in general. It’s easy to follow.
The main focus of this work is to present the continuous pseudo-labeling strategy used in the learning process to introduce unlabeled data, but the specific manner to implement this center is similar to previous audio based CPL works (2021). One important point of how to prevent the model from degenerating is also similar to previous works, i.e. dynamic cache or EMA. The components involved are existing ones. Only be used for a new task, audio-visual speech recognition, instead of audio-only rec
The method is capable of performing ASR, VSR, and AVSR using a single model, without the need for external ASR models. Additionally, the method is effective for using unlabeled audio-only and visual-only data. The paper is well-written, and the authors have conducted a thorough investigation of the training configuration, including architectural design, input stride, and output token set.
The model's performance lags behind several existing works in different settings, whether using unified or task-specific models. For instance, in the LRS3 433h regime (as shown in Table 3), the method significantly underperforms the state-of-the-art (VSR WER: 45.3 vs. 19.1, AVSR WER: 3.4 vs. 0.9). The model also demonstrates limited scalability, as can be seen from the marginal improvement from the Base to Large versions. Its advantage over SSL methods is also unclear.
1. Detailed experiments and an ablation study. 2. Combining audio and video representations can outperform a single ASR in some scenarios.
The contribution of this paper is not very clear to me: 1. Even though the authors claim that AV-CPL performs {AV, A, V}SR with a single model (compared to SSL models fine-tuned separately for each task), it actually exhibits significant variations in performance across these tasks due to different training strategies (e.g, modality dropout probability, PL stage) 2. Considering that VSR is a less common and more challenging task than ASR, I have some reservations about the necessity of a 3-in-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Indoor and Outdoor Localization Technologies
