Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Umberto Cappellazzo; Stavros Petridis; Maja Pantic

arXiv:2603.12046·eess.AS·March 13, 2026

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

PDF

Open Access

TL;DR

This paper introduces Dr. SHAP-AV, a Shapley value-based framework for analyzing how audio and visual modalities contribute to speech recognition models, revealing persistent audio bias and dynamics under noise.

Contribution

The paper presents a novel Shapley value-based framework for detailed analysis of modality contributions in AVSR, including three new analytical methods and comprehensive experimental insights.

Findings

01

Models shift toward visual reliance under noise.

02

Audio contributions remain high even with severe degradation.

03

Modality balance changes during decoding and is driven mainly by SNR.

Abstract

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation