Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Pol Buitrago; Pol G\`alvez; Oriol Pareras; Javier Hernando

arXiv:2603.08249·eess.AS·March 10, 2026

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Pol Buitrago, Pol G\`alvez, Oriol Pareras, Javier Hernando

PDF

Open Access

TL;DR

This paper introduces a zero-resource audiovisual speech recognition method that uses synthetic visual data generated from static images, enabling effective speech transcription in under-resourced languages without real audiovisual corpora.

Contribution

It presents a novel framework that synthesizes visual streams from static images for AVSR, demonstrating effectiveness in languages lacking annotated audiovisual data.

Findings

01

Achieves near state-of-the-art performance on Catalan AVSR benchmark.

02

Outperforms audio-only baseline in noisy conditions.

03

Requires significantly less data and parameters than traditional models.

Abstract

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Emotion and Mood Recognition