A Study on Speech Assessment with Visual Cues

Shafique Ahmed; Ryandhimas E. Zezario; Nasir Saleem; Amir Hussain; Hsin-Min Wang; Yu Tsao

arXiv:2506.09549·eess.AS·June 12, 2025

A Study on Speech Assessment with Visual Cues

Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang, Yu Tsao

PDF

Open Access

TL;DR

This paper introduces a multimodal framework combining audio and visual cues to non-intrusively assess speech quality and intelligibility, outperforming audio-only models on noisy datasets.

Contribution

It presents a novel dual-branch architecture that fuses spectral audio features with visual embeddings for improved speech assessment accuracy.

Findings

01

Outperforms audio-only baseline in noisy conditions

02

Improves PESQ LCC by 9.61%

03

Enhances STOI LCC by 11.47%

Abstract

Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. These features are then fused and processed by a CNN-BLSTM with attention, followed by multi-task learning to simultaneously predict PESQ and STOI. Evaluations on the LRS3-TED dataset, augmented with noise from the DEMAND corpus, show that our model outperforms the audio-only baseline. Under seen noise conditions, it improves LCC by 9.61% (0.8397->0.9205) for PESQ and 11.47% (0.7403->0.8253) for STOI. These results highlight the effectiveness of incorporating visual cues in enhancing the accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Image and Video Quality Assessment

MethodsLipschitz Constant Constraint