VisG AV-HuBERT: Viseme-Guided AV-HuBERT

Aristeidis Papadopoulos; Rishabh Jain; Naomi Harte

arXiv:2604.00982·eess.AS·April 2, 2026

VisG AV-HuBERT: Viseme-Guided AV-HuBERT

Aristeidis Papadopoulos, Rishabh Jain, Naomi Harte

PDF

TL;DR

VisG AV-HuBERT introduces a multi-task framework incorporating viseme classification to improve audiovisual speech recognition, especially under noisy conditions, by explicitly guiding the encoder to focus on visual speech features.

Contribution

The paper presents a novel viseme-guided fine-tuning approach for AV-HuBERT that enhances visual feature encoding and noise robustness in AVSR systems.

Findings

01

Achieves 51.4% relative WER reduction at -10 dB SNR compared to baseline.

02

Improves speech unit discrimination and reduces substitution errors across noise types.

03

Demonstrates generalization on LRS2 dataset.

Abstract

Audio-Visual Speech Recognition (AVSR) systems nowadays integrate Large Language Model (LLM) decoders with transformer-based encoders, achieving state-of-the-art results. However, the relative contributions of improved language modelling versus enhanced audiovisual encoding remain unclear. We propose Viseme-Guided AV-HuBERT (VisG AV-HuBERT), a multi-task fine-tuning framework that incorporates auxiliary viseme classification to strengthen the model's reliance on visual articulatory features. By extending AV-HuBERT with a lightweight viseme prediction sub-network, this method explicitly guides the encoder to preserve visual speech information. Evaluated on LRS3, VisG AV-HuBERT achieves comparable or improved performance over the baseline AV-HuBERT, with notable gains under heavy noise conditions. WER reduces from 13.59% to 6.60% (51.4% relative improvement) at -10 dB Signal-to-Noise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.